XML developers are constantly faced with questions about how fine-grained their data structures should be, and the difficult problem of dealing with cases where other people chose coarse-grained structures. While tools like XSLT can do an excellent job retrieving needles from haystacks, it's much easier to extract needles that are labelled and cleanly separated from the surrounding content.
The com.simonstl.fragment
package allows developers to specify rules using regular expressions which are applied to element content during the parsing process. While the document is parsed, those rules are applied to the textual content of the specified elements and new child elements are created, adding extra markup information to the document.
Although there are a number of tools which provide text-based processing of XML documents, this one is free, open source under the Mozilla Public License, uses widely understood technology, and fits into a wide variety of Java-based parsing environments because it fits on top of any SAX2-based parser. Set up properly, it can even feed events into a DOM tree, allowing developers to specify fragmentations during the parse and work with the results as a DOM tree.
An overview of the discussions which led to this package is available on XML.com, in Leigh Dodds' Parsing the Atom.
The Regular Fragmentations work is just getting started, and the tools are probably only fit for developers familiar with the Simple API for XML (SAX) and regular expressions. I'm hoping to have some supporting tools, like a viewer for testing expressions and for watching the application of expressions, but the core functionality and the syntax for describing rules needs to come first. Much more information - examples, API, etc. - is available in the javadoc for the package.
Download version 0.06 here.
Version 0.06 is about refactoring and the creation of some new classes. The code works, but isn't especially stable at this point - more refactoring is on the way.
Download version 0.05 here.
Version 0.05 adds support for the skip
and chars
results.
Download version 0.04 here.
Version 0.04 includes a build.xml file for use with the Java Apache Project's Ant tool.
Download version 0.03 here.
Download version 0.02 here.
Download version 0.01 here.
Copyright 2001 Simon St.Laurent