Class EntityStripper

java.lang.Object
toxi.data.feeds.util.EntityStripper

public class EntityStripper extends Object
Strips HTML entities such as " from a string, replacing them by their Unicode equivalents.
Since:
2002-07-14
Version:
2.6 2009-04-05 - StripEntities now leaves a space behind when it removes a

etc tag.

Author:
Roedy Green, Canadian Mind Products
  • Field Details

    • UNICODE_NBSP_160_0x0a

      public static final char UNICODE_NBSP_160_0x0a
      unicode nbsp control char, 160, 0x0a.
      See Also:
    • LONGEST_ENTITY

      public static final int LONGEST_ENTITY
      Longest an entity can be 10, at least in our tables, including the lead & and trail ;.
      See Also:
    • SHORTEST_ENTITY

      public static final int SHORTEST_ENTITY
      The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.
      See Also:
  • Constructor Details

    • EntityStripper

      public EntityStripper()
  • Method Details

    • bareHTMLEntityToChar

      public static char bareHTMLEntityToChar(String bareEntity, char howToTranslateNbsp)
      convert an entity to a single char.
      Parameters:
      bareEntity - String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.
      howToTranslateNbsp - char you would like &nbsp translated to, usually ' ' or (char) 160
      Returns:
      equivalent character. 0 if not recognised.
    • flattenHTML

      public static String flattenHTML(String text, char translateNbspTo)
      strips tags and entities from HTML. Leaves \n \r unchanged.
      Parameters:
      text - to flatten
      translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
      Returns:
      flattened text
    • flattenXML

      public static String flattenXML(String text)
      strips tags and entities from XML..
      Parameters:
      text - to flatten
      Returns:
      flattened text
    • possEntityToChar

      public static char possEntityToChar(String possBareEntityWithSemicolon)
      Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.
      Parameters:
      possBareEntityWithSemicolon - string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;
      Returns:
      corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160
    • stripHTMLEntities

      public static String stripHTMLEntities(String text, char translateNbspTo)
      Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.
      Parameters:
      text - raw text to be processed. Must not be null.
      translateNbspTo - char you would like   translated to, usually ' ' or (char) 160 .
      Returns:
      translated text. It also handles HTML 4.0 entities such as ♥ { and ￿   -> 160. null input returns null.
    • stripHTMLTags

      public static String stripHTMLTags(String html)
      Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves   and other entities as is.
      Parameters:
      html - input HTML
      Returns:
      raw text, with whitespaces collapsed to a single space, trimmed.
    • stripXMLEntities

      public static String stripXMLEntities(String text)
      Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.
      Parameters:
      text - raw XML text to be processed. Must not be null.
      Returns:
      translated text. null input returns null.
    • stripXMLTags

      public static String stripXMLTags(String xml)
      Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.
      Parameters:
      xml - input XML
      Returns:
      raw text, with whitespaces collapsed to a single space, trimmed.