Intro | Example | Command Line | Recursion | Skips | Repetition | Future | License (MPL) | Download
The com.simonstl.fragment package is designed to allow developers to fragment chunks of element content into smaller pieces during the course of parsing with a SAX-compliant parser. It is built on the filtering capabilities built into the SAX2 API.
The fragments are described using regular expressions, much like those specified in Appendix F of XML Schema Part 2, Datatypes. The actual regular expression processing is done using a utility package of the Apache XML project's Xerces parser. (For convenience, the FilterTester uses the Apache parser's SAX functionality as well as David Megginson's XML Writer.)
An overview of the discussions which led to this package is available on XML.com, in Leigh Dodds' Parsing the Atom.
Fragment processing applies a set of rules to a given document during parsing. The rules are specified as an XML document, using a simple vocabulary:
<?xml version="1.0" encoding="UTF-8"?> <fragmentRules xmlns="http://simonstl.com/ns/fragments/"> <fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})"> <applyTo> <targetElement nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/> <targetElement nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/> </applyTo> <produce> <resultElement nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" /> <resultElement nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" /> <resultElement nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" /> </produce> </fragmentRule> </fragmentRules>
This set of fragmentRules
includes one rule which applies to two elements in the http://simonstl.com/ns/test/
namespace - gYearMonth
and myYearMonth
. The pattern
attribute contains a regular expression ((\d{2,5})(\d{2})-(\d{2})
) which defines how the content of those elements will be broken down. The produce
element contains information describing how those parts should be represented as elements.
This means that the XML document:
<test xmlns="http://simonstl.com/ns/test/"> <message>Hello! This document contains a gYearMonth.</message> <gYearMonth>1977-11</gYearMonth> <myYearMonth>1977-11</myYearMonth> </test>
will be reported as:
<?xml version="1.0" standalone="yes"?> <test xmlns="http://simonstl.com/ns/test/"> <message>Hello! This document contains a gYearMonth.</message> <gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce ntury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type: month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gYearMonth> <myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c entury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type :month xmlns:type="http://simonstl.com/ns/types/">11</type:month></myYearMonth> </test>
The FilterTester
class provides a means for you to test files and see the results. To run it, enter:
java FilterTester rules.xml targetFile.xml
where rules.xml is the file containing the fragment rules and targetFile.xml contains the XML document to which those rules should be applied. The FilterTester.java
file also contains code which can show you how to instantiate the FragmentFilter
class.
Rules may also be applied recursively to the results of prior rules. For some types - notably dates - this is important for proper processing of complex pieces. For example, we could add a rule which breaks the date information above even further, into individual digits:
<fragmentRule pattern="(\d{1})(\d{1})"> <applyTo> <targetElement nsURI="http://simonstl.com/ns/types/" localName="century" /> <targetElement nsURI="http://simonstl.com/ns/types/" localName="year" /> <targetElement nsURI="http://simonstl.com/ns/types/" localName="month" /> </applyTo> <produce> <resultElement nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" /> <resultElement nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" /> </produce> </fragmentRule>
When that rule was applied in combination with the previous rule, the results would look like:
<?xml version="1.0" standalone="yes"?> <test xmlns="http://simonstl.com/ns/test/"> <message>Hello! This document contains a gYearMonth.</message> <gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digit >1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="h ttp://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:digi t></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digit >1</type:digit><type:digit>1</type:digit></type:month></gYearMonth> <myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digi t>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type=" http://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:dig it></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digi t>1</type:digit><type:digit>1</type:digit></type:month></myYearMonth> </test>
Developers may find these smaller parts easier to work with.
Regular expressions sometimes produce more results than a program needs. The first match, for example, typically returns the entire matched content, which doesn't fit well with the fragmentation approach. Currently, the FragmentFilter discards that result, unless you set skipFirst="false"
on the fragmentRule
element. Other times there may simply be content that needs deletion.
Content may be deleted by specifying a resultElement
whose attributes are all set to the empty string. The RulesLoader also permits the use of an empty skip
element, which produces the same result.
For example, to skip the year in the century-year-month pattern of gYearMonth
, the fragmentRule
element below would work:
<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})"> <applyTo> <targetElement nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/> <targetElement nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/> </applyTo> <produce> <resultElement nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" /> <skip/><!--skips year - is equivalent to: <resultElement nsURI="" localName="" prefix="" />--> <resultElement nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" /> </produce> </fragmentRule>
Applied to the same document shown above, this rule would produce:
<?xml version="1.0" standalone="yes"?> <test xmlns="http://simonstl.com/ns/test/"> <message>Hello! This document contains a gYearMonth.</message> <gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce ntury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gY earMonth> <myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c entury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></m yYearMonth> </test>
The year
elements are now missing, as are their content.
There may be times when you want to cycle through a set of elements. In this case, you can set repeat="yes"
on the fragmentRule
element. The processor will cycle through the list of rules you provide.
Future development will focus on:
split()
functionality as well as matchingstyle
attributeFragmentFilter
classIt's very simple stuff, but these kinds of transformations can simplify processing XML content with compound information chunks significantly.
The contents of this package are subject to the Mozilla Public License Version 1.1 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.mozilla.org/MPL/.
Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.
The Original Code is available at http://simonstl.com/projects/fragment/original.
The Initial Developer of the Original Code is Simon St.Laurent. Portions created by Simon St.Laurent are Copyright (C) 2001 Simon St.Laurent. All Rights Reserved.
Contributor(s):
Download version 0.03 here.
Download version 0.02 here.
Download version 0.01 here.
Intro | Example | Command Line | Recursion | Skips | Repetition | Future | License (MPL) | Download