Regular Fragmentations Rule Language (RFRL)

2. Details

In general, RFRL isn't very picky about where it's stored. You can have "RFRL documents", where the fragmentRules element is the root element, or you can mix RFRL with other kinds of XML documents. It might, for example, make sense to mix RFRL with namespace processing vocabularies. What matters is the creation of complete fragmentRule elements, each containing an applyTo and a produce element.

Simple processing

Fragment processing applies a set of rules to a given document during parsing. The rules are specified as an XML document, using a simple vocabulary:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
  <element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
  <element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
  <element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

This set of fragmentRules includes one rule which applies to two elements in the http://simonstl.com/ns/test/ namespace - gYearMonth and myYearMonth. The pattern attribute contains a regular expression ((\d{2,5})(\d{2})-(\d{2})) which defines how the content of those elements will be broken down. The produce element contains information describing how those parts should be represented as elements, characters, or simply skipped.

This means that the XML document:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth>1970-11</gYearMonth>
<myYearMonth>1970-11</myYearMonth>
</test>

will be reported as:

<?xml version="1.0" standalone="yes"?>
<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type:
month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type
:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></myYearMonth>
</test>

Recursive Processing

Rules may also be applied recursively to the results of prior rules. For some types - notably dates - this is important for proper processing of complex pieces. For example, we could add a rule which breaks the date information above even further, into individual digits:

<fragmentRule pattern="(\d{1})(\d{1})">
<applyTo>
<element nsURI="http://simonstl.com/ns/types/" localName="century"  />
<element nsURI="http://simonstl.com/ns/types/" localName="year"  />
<element nsURI="http://simonstl.com/ns/types/" localName="month"  />
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
</produce>
</fragmentRule>

When that rule was applied in combination with the previous rule, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="h
ttp://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:digi
t></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>1</type:digit></type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="
http://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:dig
it></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>1</type:digit></type:month></myYearMonth>
</test>

Developers may find these smaller parts easier to work with.

Skipping results

Regular expressions sometimes produce more results than a program needs. The first match, for example, typically returns the entire matched content, which doesn't fit well with the fragmentation approach. Currently, the FragmentFilter discards that result, unless you set skipFirst="false" on the fragmentRule element. Other times there may simply be content that needs deletion.

Content may be deleted by specifying a skip. For example, to skip the year in the century-year-month pattern of gYearMonth, the fragmentRule element below would work:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<skip/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document shown above, this rule would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gY
earMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></m
yYearMonth>
</test>

The year elements are now missing, as are their content.

Reporting some content as characters

While fragmenting content into child elements is useful, there may also be times when you want to generate mixed content, with one or more of the fragments passed through as characters. To do that, you use the chars element. The chars element tells the processor to pass the content of the matched fragment through as text without an element wrapper. For example, to allow the year to pass through without the test:year element containing it, you could apply these rules:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document again, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></
gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month><
/myYearMonth>
</test>

You can also use the before and after attributes on the chars element to add textual content either before or after a fragment of text. (You can also use these attributes on the element and skip elements.) For example, you might emphasize the year by adding a "year: " label and an exclamation point afterward with the rule below:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars before="year: " after="!"/><!--includes year as characters -->
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

This would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:m
onth></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:
month></myYearMonth>
</test>

Returning fragments as attributes

You can also report fragments as attributes of the element containing them. To report fragments as attributes, just use attribute elements inside the produce element. You can mix and match attribute with skip, chars, and element as you find appropriate. For example, to report the year information as attributes rather than child elements, you might use:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">
<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>

<produce>
<attribute nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>
</fragmentRules>

When applied to the test document, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a gYearMonth.</message>

<gYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http://
simonstl.com/ns/types/"></gYearMonth>

<myYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http:/
/simonstl.com/ns/types/"></myYearMonth>
</test>

Repeating the same attribute name will leave only the last version in the final output.

Splitting content with delimiters

Some content doesn't simply match to a pattern - instead, it needs to be split. The Regular Fragmentation package allows developers to specify split functionality (akin to the Perl split() function) by setting the split attribute of the fragmentRule element to "true":

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(-)" split="true" skipFirst="false">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="threepart"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="one" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="two" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="three" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

Also note the use of skipFirst="false" to avoid matching the first result of the split. These rules apply to an document that uses delimiters, like:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a splittable piece.</message>
<threepart>2100-b4a50-999F</threepart>
</test>

When the rules are applied to the document, the results look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a splittable piece.</message>

<threepart>
<type:one xmlns:type="http://simonstl.com/ns/types/">2100</type:one>
<type:two xmlns:type="http://simonstl.com/ns/types/">b4a50</type:two>
<type:three xmlns:type="http://simonstl.com/ns/types/">999F</type:three>
</threepart>
</test>

Complex splits may want to use the repetition feature below.

Repetition

There may be times when you want to cycle through a set of elements. In this case, you can set repeat="yes" on the fragmentRule element. The processor will cycle through the list of rules you provide.

Future development

Future development will focus on:

Extending this functionality to attributes
More command-line options for the FragmentFilter class
Test code for all of the modules used
The ability to resequence fragments
lex-style processing of expression series
A friendlier graphic interface for exploration and testing
A small library of prefab regular expressions

It's very simple stuff, but these kinds of transformations can simplify processing XML content with compound information chunks significantly.

Regular Fragmentations Rule Language (RFRL)

Table of contents

1. Introduction

2. Details

Simple processing

Recursive Processing

Skipping results

Reporting some content as characters

Returning fragments as attributes

Splitting content with delimiters

Repetition

Future development

4. Normative References

5. Informative References

Regular Fragmentations Rule Language (RFRL)

Table of contents

1. Introduction

2. Details

3. Related Resources for RFRL

3.1 Document Type Definition

3.2 SAX Filter

3.3 JAR

4. Normative References

5. Informative References