Helma logo
main list history

Html Processing

Ideas

Some separation of concerns:

  1. Parsing HTML for skin rendering
  2. Adding of missing tags
  3. Adding of formatting tags
  4. entity encoding
  5. Plugin architecture for stuff like wiki formatting

Potentially of interest: <http://mercury.ccil.org/~cowan/XML/tagsoup/>.

Another interesting Package (via Jürg on helma-dev): <http://htmlparser.sourceforge.net/>

Plan A

  1. Keep current code as starting point, as I can't find any other code with a similar feature mix (most importantly smart formatting) that looks like it's worth the switch.
  2. Separate character entity escaping from the formatting/tag closing.
  3. Update the list of recognized tags from the Tagsoup project.
  4. Allow for plugins to handle formatting at various stages, e.g. before/after/instead of default formatting.

Plan B

  1. Keep current code for character entity escaping only.
  2. Use Tagsoup for cleaning up tags and -- using the knowledge from helma's old html formatter -- to generate break/paragraph tags

Open Issues

We should provide a feature to only allow certain tag/attribute combinations to exclude scripts or just to keep people from ruining the layout.

Skin parsing might start from this code too if we move to HTML/XML style skin tags.

Tagsoup notes

  • Very small (around 50 k)
  • Implements straight SAX2 parser
  • Never ever throws an exception
  • But does tag balancing, tag insertion etc.
  • Is pretty good at this tag balancing business, probably better than HtmlParser
  • Likes to convert HTML snippets to whole documents, which is not really convenient for some of our purposes (but not a big problem either: just provide convenience methods to drop the html and body tags)

HtmlParser notes

  • sufficiently small (full jar is ~300 k)
  • low level lexer is even smaller (~70 k), might be enough for us
  • NodeFactory class is used to create Nodes
  • most important Node subinterfaces are Text and Tag
  • Tag has methods getEnders() and getEndTagEnders() that determine which (end) tags will close this tag to implement tag balancing/injection of virtual end tags
  • Tag balancing is only done if a matching subclass of CompositeTag exists and is registered
  • Nodes and NodeLists have a toHtml() method that convert it back to html text
  • HtmlParser provides an advanced filtering/nodewalking framework that would be quite useful for tag/attribute filtering