Regular Fragmentations

Intro | Status | Download

Breaking Element Content into Child Elements

XML developers are constantly faced with questions about how fine-grained their data structures should be, and the difficult problem of dealing with cases where other people chose coarse-grained structures. While tools like XSLT can do an excellent job retrieving needles from haystacks, it's much easier to extract needles that are labelled and cleanly separated from the surrounding content.

The com.simonstl.fragment package allows developers to specify rules using regular expressions which are applied to element content during the parsing process. While the document is parsed, those rules are applied to the textual content of the specified elements and new child elements are created, adding extra markup information to the document.

Although there are a number of tools which provide text-based processing of XML documents, this one is free, open source under the Mozilla Public License, uses widely understood technology, and fits into a wide variety of Java-based parsing environments because it fits on top of any SAX2-based parser. Set up properly, it can even feed events into a DOM tree, allowing developers to specify fragmentations during the parse and work with the results as a DOM tree.

An overview of the discussions which led to this package is available on XML.com, in Leigh Dodds' Parsing the Atom.

Status

The Regular Fragmentations work is just getting started, and the tools are probably only fit for developers familiar with the Simple API for XML (SAX) and regular expressions. I'm hoping to have some supporting tools, like a viewer for testing expressions and for watching the application of expressions, but the core functionality and the syntax for describing rules needs to come first. Much more information - examples, API, etc. - is available in the javadoc for the package.

Download

Download version 0.11 here.

Version 0.11 refactors rule creation substantially, adding a ComponentFactory which uses SAX startElement information to build rule components.

Previous versions:

Download version 0.10 here.

Version 0.10 starts on a refactoring of the processing, but mostly succeeds in the regular expression area. You can now build wrappers around the regex engine of your choice. Three wrappers are provided - two for matching (with Xerces and the IBM regex4j package) and one for splitting (with the Apache regexp package.)

Download version 0.09 here.

Version 0.09 cleans up the rules vocabulary a bit - resultElement and targetElement are both now element, using context to determine their meaning. The command line now has additional options, including a rules dump and the ability to choose your own parser.

Download version 0.08 here.

Version 0.08 adds much more thorogh support for the DocComponent view of recursive processing. The code is more stable at this point - most refactoring is complete, but there's a bit left to go.

Download version 0.07 here.

Version 0.07 adds support for split() functionality. Again, the code works, but isn't especially stable at this point - yet more refactoring is on the way.

Download version 0.06 here.

Version 0.06 is about refactoring and the creation of some new classes. The code works, but isn't especially stable at this point - more refactoring is on the way.

Download version 0.05 here.

Version 0.05 adds support for the skip and chars results.

Download version 0.04 here.

Version 0.04 includes a build.xml file for use with the Java Apache Project's Ant tool.

Download version 0.03 here.

Download version 0.02 here.

Download version 0.01 here.

Intro | Status | Download

Copyright 2001 Simon St.Laurent