Regular Fragmentations

See:
          Description

Packages
com.simonstl.fragment  

 

Regular Fragmentations

Intro | Example | Command Line | Recursion | Skips | Characters | Attributes | Splits | Repetition | Future | License (MPL) | Download

Fragmenting textual content into XML elements

The com.simonstl.fragment package is designed to allow developers to fragment chunks of element content into smaller pieces during the course of parsing with a SAX-compliant parser. It is built on the filtering capabilities built into the SAX2 API.

The fragments are described using regular expressions, much like those specified in Appendix F of XML Schema Part 2, Datatypes. The actual regular expression processing is done using a utility package of the Apache XML project's Xerces parser for matches, and the Apache Jakarta's regexp package for splits. (For convenience, the FilterTester uses the Apache parser's SAX functionality as well as David Megginson's XML Writer.)

An overview of the discussions which led to this package is available on XML.com, in Leigh Dodds' Parsing the Atom.

An example

Fragment processing applies a set of rules to a given document during parsing. The rules are specified as an XML document, using a simple vocabulary:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
  <element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
  <element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
  <element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

This set of fragmentRules includes one rule which applies to two elements in the http://simonstl.com/ns/test/ namespace - gYearMonth and myYearMonth. The pattern attribute contains a regular expression ((\d{2,5})(\d{2})-(\d{2})) which defines how the content of those elements will be broken down. The produce element contains information describing how those parts should be represented as elements, characters, or simply skipped.

This means that the XML document:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth>1970-11</gYearMonth>
<myYearMonth>1970-11</myYearMonth>
</test>

will be reported as:

<?xml version="1.0" standalone="yes"?>
<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type:
month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type
:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></myYearMonth>
</test>

Command Line Processing

The FilterTester class provides a means for you to test files and see the results. To run it, enter:

java com.simonstl.fragment.FilterTester rules.xml targetFile.xml

where rules.xml is the file containing the fragment rules and targetFile.xml contains the XML document to which those rules should be applied.

The FilterTester class also allows you to specify a parser from the command line or print out a view of the rules:

FilterTester requires two arguments - a rules file and then a target XML file

Options may appear before those arguments:
-m [parser class name]  - use the match processor specified
-p [parser class name] - specifies a SAX2 parser for use with this tool
-r  - parse and display the rules only
-s [parser class name]  - use the split processor specified
-ss [parser class name]  - use the series processor specified (not implemented)
More information is available at http://simonstl.com/projects/fragment/

The -m, -p, -r, and -s, are shortcuts for setting the Java properties used to configure the processor.

The FilterTester.java file also contains code which can show you how to instantiate the FragmentFilter class.

Recursive Processing

Rules may also be applied recursively to the results of prior rules. For some types - notably dates - this is important for proper processing of complex pieces. For example, we could add a rule which breaks the date information above even further, into individual digits:

<fragmentRule pattern="(\d{1})(\d{1})">
<applyTo>
<element nsURI="http://simonstl.com/ns/types/" localName="century"  />
<element nsURI="http://simonstl.com/ns/types/" localName="year"  />
<element nsURI="http://simonstl.com/ns/types/" localName="month"  />
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
</produce>
</fragmentRule>

When that rule was applied in combination with the previous rule, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="h
ttp://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:digi
t></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>1</type:digit></type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="
http://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:dig
it></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>1</type:digit></type:month></myYearMonth>
</test>

Developers may find these smaller parts easier to work with.

Skipping results

Regular expressions sometimes produce more results than a program needs. The first match, for example, typically returns the entire matched content, which doesn't fit well with the fragmentation approach. Currently, the FragmentFilter discards that result, unless you set skipFirst="false" on the fragmentRule element. Other times there may simply be content that needs deletion.

Content may be deleted by specifying a skip. For example, to skip the year in the century-year-month pattern of gYearMonth, the fragmentRule element below would work:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<skip/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document shown above, this rule would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gY
earMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></m
yYearMonth>
</test>

The year elements are now missing, as are their content.

Reporting some content as characters

While fragmenting content into child elements is useful, there may also be times when you want to generate mixed content, with one or more of the fragments passed through as characters. To do that, you use the chars element. The chars element tells the processor to pass the content of the matched fragment through as text without an element wrapper. For example, to allow the year to pass through without the test:year element containing it, you could apply these rules:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document again, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></
gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month><
/myYearMonth>
</test>

You can also use the before and after attributes on the chars element to add textual content either before or after a fragment of text. (You can also use these attributes on the element and skip elements.) For example, you might emphasize the year by adding a "year: " label and an exclamation point afterward with the rule below:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars before="year: " after="!"/><!--includes year as characters -->
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

This would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:m
onth></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:
month></myYearMonth>
</test>

Returning fragments as attributes

You can also report fragments as attributes of the element containing them. To report fragments as attributes, just use attribute elements inside the produce element. You can mix and match attribute with skip, chars, and element as you find appropriate. For example, to report the year information as attributes rather than child elements, you might use:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">
<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>

<produce>
<attribute nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>
</fragmentRules>

When applied to the test document, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a gYearMonth.</message>

<gYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http://
simonstl.com/ns/types/"></gYearMonth>

<myYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http:/
/simonstl.com/ns/types/"></myYearMonth>
</test>

Repeating the same attribute name will leave only the last version in the final output.

Splitting content with delimiters

Some content doesn't simply match to a pattern - instead, it needs to be split. The Regular Fragmentation package allows developers to specify split functionality (akin to the Perl split() function) by setting the split attribute of the fragmentRule element to "true":

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(-)" split="true" skipFirst="false">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="threepart"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="one" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="two" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="three" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

Also note the use of skipFirst="false" to avoid matching the first result of the split. These rules apply to an document that uses delimiters, like:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a splittable piece.</message>
<threepart>2100-b4a50-999F</threepart>
</test>

When the rules are applied to the document, the results look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a splittable piece.</message>

<threepart>
<type:one xmlns:type="http://simonstl.com/ns/types/">2100</type:one>
<type:two xmlns:type="http://simonstl.com/ns/types/">b4a50</type:two>
<type:three xmlns:type="http://simonstl.com/ns/types/">999F</type:three>
</threepart>
</test>

Complex splits may want to use the repetition feature below.

Repetition

There may be times when you want to cycle through a set of elements. In this case, you can set repeat="yes" on the fragmentRule element. The processor will cycle through the list of rules you provide.

Future development

Future development will focus on:

It's very simple stuff, but these kinds of transformations can simplify processing XML content with compound information chunks significantly.

License

The contents of this package are subject to the Mozilla Public License Version 1.1 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.mozilla.org/MPL/.

Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.

The Original Code is available at http://simonstl.com/projects/fragment/original.

The Initial Developer of the Original Code is Simon St.Laurent. Portions created by Simon St.Laurent are Copyright (C) 2001 Simon St.Laurent. All Rights Reserved.

Contributor(s):

Download

Download version 0.11 here.

Previous versions:

Download version 0.10 here.

Download version 0.09 here.

Download version 0.08 here.

Download version 0.07 here.

Download version 0.06 here.

Download version 0.05 here.

Download version 0.04 here.

Download version 0.03 here.

Download version 0.02 here.

Download version 0.01 here.

Intro | Example | Command Line | Recursion | Skips | Characters | Attributes | Splits | Repetition | Future | License (MPL) | Download