Package toxi.data.feeds.util
Class EntityStripper
java.lang.Object
toxi.data.feeds.util.EntityStripper
Strips HTML entities such as " from a string, replacing them by their
Unicode equivalents.
- Since:
- 2002-07-14
- Version:
- 2.6 2009-04-05 - StripEntities now leaves a space behind when it
removes a
etc tag. - Author:
- Roedy Green, Canadian Mind Products
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Longest an entity can be 10, at least in our tables, including the lead & and trail ;.static final int
The shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.static final char
unicode nbsp control char, 160, 0x0a. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic char
bareHTMLEntityToChar
(String bareEntity, char howToTranslateNbsp) convert an entity to a single char.static String
flattenHTML
(String text, char translateNbspTo) strips tags and entities from HTML.static String
flattenXML
(String text) strips tags and entities from XML..static char
possEntityToChar
(String possBareEntityWithSemicolon) Checks a number of gauntlet conditions to ensure this is a valid entity.static String
stripHTMLEntities
(String text, char translateNbspTo) Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.static String
stripHTMLTags
(String html) Removes tags from HTML leaving just the raw text.static String
stripXMLEntities
(String text) Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged.static String
stripXMLTags
(String xml) Removes tags from XML leaving just the raw text.
-
Field Details
-
UNICODE_NBSP_160_0x0a
public static final char UNICODE_NBSP_160_0x0aunicode nbsp control char, 160, 0x0a.- See Also:
-
LONGEST_ENTITY
public static final int LONGEST_ENTITYLongest an entity can be 10, at least in our tables, including the lead & and trail ;.- See Also:
-
SHORTEST_ENTITY
public static final int SHORTEST_ENTITYThe shortest an entity can be 4, at least in our tables, including the lead & and trailing ;.- See Also:
-
-
Constructor Details
-
EntityStripper
public EntityStripper()
-
-
Method Details
-
bareHTMLEntityToChar
convert an entity to a single char.- Parameters:
bareEntity
- String entity to convert convert. must have lead & and trail ; stripped; may have form: #x12ff or #123 or lt or nbsp style entity. Works faster if entity in lower case.howToTranslateNbsp
- char you would like   translated to, usually ' ' or (char) 160- Returns:
- equivalent character. 0 if not recognised.
-
flattenHTML
strips tags and entities from HTML. Leaves \n \r unchanged.- Parameters:
text
- to flattentranslateNbspTo
- char you would like translated to, usually ' ' or (char) 160 .- Returns:
- flattened text
-
flattenXML
strips tags and entities from XML..- Parameters:
text
- to flatten- Returns:
- flattened text
-
possEntityToChar
Checks a number of gauntlet conditions to ensure this is a valid entity. Converts Entity to corresponding char.- Parameters:
possBareEntityWithSemicolon
- string that may hold an entity. Lead & must be stripped, but may optionally contain text past the ;- Returns:
- corresponding unicode character, or 0 if the entity is invalid. nbsp -> (char) 160
-
stripHTMLEntities
Converts HTML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.- Parameters:
text
- raw text to be processed. Must not be null.translateNbspTo
- char you would like translated to, usually ' ' or (char) 160 .- Returns:
- translated text. It also handles HTML 4.0 entities such as ♥ { and -> 160. null input returns null.
-
stripHTMLTags
Removes tags from HTML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed HTML, no > in comments, all <...> balanced. Also removes text between applet, style and script tag pairs. Leaves and other entities as is.- Parameters:
html
- input HTML- Returns:
- raw text, with whitespaces collapsed to a single space, trimmed.
-
stripXMLEntities
Converts XML to text converting entities such as " back to " and < back to < Ordinary text passes unchanged. Also strips decimal and hex entities and stray HTML entities.- Parameters:
text
- raw XML text to be processed. Must not be null.- Returns:
- translated text. null input returns null.
-
stripXMLTags
Removes tags from XML leaving just the raw text. Leaves entities as is, e.g. does not convert & back to &. similar to code in Quoter. Also removes <!-- --> comments. Presumes perfectly formed XML, no > in comments, all <...> balanced. Leaves entities as is.- Parameters:
xml
- input XML- Returns:
- raw text, with whitespaces collapsed to a single space, trimmed.
-