Package toxi.data.feeds.util
Class EntityStripper
java.lang.Object
toxi.data.feeds.util.EntityStripper
Strips HTML entities such as " from a string, replacing them by their
Unicode equivalents.
- Since:
- 2002-07-14
- Version:
- 2.6 2009-04-05 - StripEntities now leaves a space behind when it
removes a
etc tag. - Author:
- Roedy Green, Canadian Mind Products
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intLongest an entity can be 10, at least in our tables, including the lead & and trail ;.static final intThe shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.static final charunicode nbsp control char, 160, 0x0a. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic charbareHTMLEntityToChar(String bareEntity, char howToTranslateNbsp) convert an entity to a single char.static StringflattenHTML(String text, char translateNbspTo) strips tags and entities from HTML.static StringflattenXML(String text) strips tags and entities from XML..static charpossEntityToChar(String possBareEntityWithSemicolon) Checks a number of gauntlet conditions to ensure this is a valid entity.static StringstripHTMLEntities(String text, char translateNbspTo) Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.static StringstripHTMLTags(String html) Removes tags from HTML leaving just the raw text.static StringstripXMLEntities(String text) Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.static StringstripXMLTags(String xml) Removes tags from XML leaving just the raw text.
-
Field Details
-
UNICODE_NBSP_160_0x0a
public static final char UNICODE_NBSP_160_0x0aunicode nbsp control char, 160, 0x0a.- See Also:
-
LONGEST_ENTITY
public static final int LONGEST_ENTITYLongest an entity can be 10, at least in our tables, including the lead & and trail ;.- See Also:
-
SHORTEST_ENTITY
public static final int SHORTEST_ENTITYThe shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.- See Also:
-
-
Constructor Details
-
EntityStripper
public EntityStripper()
-
-
Method Details
-
bareHTMLEntityToChar
convert an entity to a single char.- Parameters:
bareEntity- String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.howToTranslateNbsp- char you would like   translated to, usually ' ' or (char) 160- Returns:
- equivalent character. 0 if not recognised.
-
flattenHTML
strips tags and entities from HTML. Leaves \n \r unchanged.- Parameters:
text- to flattentranslateNbspTo- char you would like translated to, usually ' ' or (char) 160 .- Returns:
- flattened text
-
flattenXML
strips tags and entities from XML..- Parameters:
text- to flatten- Returns:
- flattened text
-
possEntityToChar
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.- Parameters:
possBareEntityWithSemicolon- string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;- Returns:
- corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160
-
stripHTMLEntities
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.- Parameters:
text- raw text to be processed. Must not be null.translateNbspTo- char you would like translated to, usually ' ' or (char) 160 .- Returns:
- translated text. It also handles HTML 4.0 entities such as ♥ { and -> 160. null input returns null.
-
stripHTMLTags
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves and other entities as is.- Parameters:
html- input HTML- Returns:
- raw text, with whitespaces collapsed to a single space, trimmed.
-
stripXMLEntities
Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.- Parameters:
text- raw XML text to be processed. Must not be null.- Returns:
- translated text. null input returns null.
-
stripXMLTags
Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.- Parameters:
xml- input XML- Returns:
- raw text, with whitespaces collapsed to a single space, trimmed.
-