Regular Fragmentations Rule Language (RFRL)

This Version: February 28, 2002

Latest Version: http://simonstl.com/ns/fragment/

Previous Versions:
http://simonstl.com/ns/fragment/rfrl-072201.htm
http://simonstl.com/ns/fragment/rfrl-070501.htm

Editors:

Table of contents

  1. Introduction
  2. Details
  3. Resources
  4. Normative References
  5. Informative References

1. Introduction

Regular Fragmentations Rule Language (RFRL) provides a list of rules which are applied to element or attribute content in order to divide their content into smaller and more readily processed pieces.

XML allows document creators to apply markup to information, creating descriptive document structures. Creators of XML documents and document descriptions (DTDs, W3C Schemas, RELAX NG, etc.) can choose different levels of markup. Sometimes this varies by document instances, and sometimes document descriptions specify particular combinations of information which may appear in a single markup component.

While "compound" types are frequently discouraged as bad practice, compound types are enshrined in several key W3C specifications, notably XML Schema Datatypes, and The Style Attribute in HTML. Developers working with documents created using these specifications and others like them generally have to perform post-processing on these compound types. Regular Fragmentations are designed to allow these types to be fragmented into smaller labelled pieces during the course of parsing. Developers can then perform processing using XML-oriented tools without having to focus on post-processing the content.

An overview of the discussions which led to this rules vocabulary is available on XML.com, in Leigh Dodds' Parsing the Atom.

2. Details

In general, RFRL isn't very picky about where it's stored. You can have "RFRL documents", where the fragmentRules element is the root element, or you can mix RFRL with other kinds of XML documents. It might, for example, make sense to mix RFRL with namespace processing vocabularies. What matters is the creation of complete fragmentRule elements, each containing an applyTo and a produce element.

Simple processing

Fragment processing applies a set of rules to a given document during parsing. The rules are specified as an XML document, using a simple vocabulary:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
  <element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
  <element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
  <element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
  <element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

This set of fragmentRules includes one rule which applies to two elements in the http://simonstl.com/ns/test/ namespace - gYearMonth and myYearMonth. The pattern attribute contains a regular expression ((\d{2,5})(\d{2})-(\d{2})) which defines how the content of those elements will be broken down. The produce element contains information describing how those parts should be represented as elements, characters, or simply skipped.

This means that the XML document:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth>1970-11</gYearMonth>
<myYearMonth>1970-11</myYearMonth>
</test>

will be reported as:

<?xml version="1.0" standalone="yes"?>
<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type:
month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:year xmlns:type="http://simonstl.com/ns/types/">70</type:year><type
:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></myYearMonth>
</test>

Recursive Processing

Rules may also be applied recursively to the results of prior rules. For some types - notably dates - this is important for proper processing of complex pieces. For example, we could add a rule which breaks the date information above even further, into individual digits:

<fragmentRule pattern="(\d{1})(\d{1})">
<applyTo>
<element nsURI="http://simonstl.com/ns/types/" localName="century"  />
<element nsURI="http://simonstl.com/ns/types/" localName="year"  />
<element nsURI="http://simonstl.com/ns/types/" localName="month"  />
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="digit" prefix="type" />
</produce>
</fragmentRule>

When that rule was applied in combination with the previous rule, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="h
ttp://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:digi
t></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digit
>1</type:digit><type:digit>1</type:digit></type:month></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>9</type:digit></type:century><type:year xmlns:type="
http://simonstl.com/ns/types/"><type:digit>7</type:digit><type:digit>0</type:dig
it></type:year><type:month xmlns:type="http://simonstl.com/ns/types/"><type:digi
t>1</type:digit><type:digit>1</type:digit></type:month></myYearMonth>
</test>

Developers may find these smaller parts easier to work with.

Skipping results

Regular expressions sometimes produce more results than a program needs. The first match, for example, typically returns the entire matched content, which doesn't fit well with the fragmentation approach. Currently, the FragmentFilter discards that result, unless you set skipFirst="false" on the fragmentRule element. Other times there may simply be content that needs deletion.

Content may be deleted by specifying a skip. For example, to skip the year in the century-year-month pattern of gYearMonth, the fragmentRule element below would work:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<skip/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document shown above, this rule would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></gY
earMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury><type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></m
yYearMonth>
</test>

The year elements are now missing, as are their content.

Reporting some content as characters

While fragmenting content into child elements is useful, there may also be times when you want to generate mixed content, with one or more of the fragments passed through as characters. To do that, you use the chars element. The chars element tells the processor to pass the content of the matched fragment through as text without an element wrapper. For example, to allow the year to pass through without the test:year element containing it, you could apply these rules:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars/>
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

Applied to the same document again, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month></
gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>70<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:month><
/myYearMonth>
</test>

You can also use the before and after attributes on the chars element to add textual content either before or after a fragment of text. (You can also use these attributes on the element and skip elements.) For example, you might emphasize the year by adding a "year: " label and an exclamation point afterward with the rule below:

<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<chars before="year: " after="!"/><!--includes year as characters -->
<element nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>

This would produce:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a gYearMonth.</message>
<gYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:ce
ntury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:m
onth></gYearMonth>
<myYearMonth><type:century xmlns:type="http://simonstl.com/ns/types/">19</type:c
entury>year: 70!<type:month xmlns:type="http://simonstl.com/ns/types/">11</type:
month></myYearMonth>
</test>

Returning fragments as attributes

You can also report fragments as attributes of the element containing them. To report fragments as attributes, just use attribute elements inside the produce element. You can mix and match attribute with skip, chars, and element as you find appropriate. For example, to report the year information as attributes rather than child elements, you might use:

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">
<fragmentRule pattern="(\d{2,5})(\d{2})-(\d{2})">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="gYearMonth"/>
<element nsURI="http://simonstl.com/ns/test/" localName="myYearMonth"/>
</applyTo>

<produce>
<attribute nsURI="http://simonstl.com/ns/types/" localName="century" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="year" prefix="type" />
<attribute nsURI="http://simonstl.com/ns/types/" localName="month" prefix="type" />
</produce>
</fragmentRule>
</fragmentRules>

When applied to the test document, the results would look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a gYearMonth.</message>

<gYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http://
simonstl.com/ns/types/"></gYearMonth>

<myYearMonth type:year="70" type:month="11" type:century="19" xmlns:type="http:/
/simonstl.com/ns/types/"></myYearMonth>
</test>

Repeating the same attribute name will leave only the last version in the final output.

Splitting content with delimiters

Some content doesn't simply match to a pattern - instead, it needs to be split. The Regular Fragmentation package allows developers to specify split functionality (akin to the Perl split() function) by setting the split attribute of the fragmentRule element to "true":

<?xml version="1.0" encoding="UTF-8"?>
<fragmentRules xmlns="http://simonstl.com/ns/fragments/">

<fragmentRule pattern="(-)" split="true" skipFirst="false">
<applyTo>
<element nsURI="http://simonstl.com/ns/test/" localName="threepart"/>
</applyTo>
<produce>
<element nsURI="http://simonstl.com/ns/types/" localName="one" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="two" prefix="type" />
<element nsURI="http://simonstl.com/ns/types/" localName="three" prefix="type" />
</produce>
</fragmentRule>

</fragmentRules>

Also note the use of skipFirst="false" to avoid matching the first result of the split. These rules apply to an document that uses delimiters, like:

<test xmlns="http://simonstl.com/ns/test/">
<message>Hello!  This document contains a splittable piece.</message>
<threepart>2100-b4a50-999F</threepart>
</test>

When the rules are applied to the document, the results look like:

<?xml version="1.0" standalone="yes"?>

<test xmlns="http://simonstl.com/ns/test/">

<message>Hello!  This document contains a splittable piece.</message>

<threepart>
<type:one xmlns:type="http://simonstl.com/ns/types/">2100</type:one>
<type:two xmlns:type="http://simonstl.com/ns/types/">b4a50</type:two>
<type:three xmlns:type="http://simonstl.com/ns/types/">999F</type:three>
</threepart>
</test>

Complex splits may want to use the repetition feature below.

Repetition

There may be times when you want to cycle through a set of elements. In this case, you can set repeat="yes" on the fragmentRule element. The processor will cycle through the list of rules you provide.

Future development

Future development will focus on:

It's very simple stuff, but these kinds of transformations can simplify processing XML content with compound information chunks significantly.

3. Related Resources for RFRL

3.1 Document Type Definition

A DTD fragrules.dtd for RFRL.

3.2 SAX Filter

A SAX Filter implementing RFRL. JavaDoc describing its class structure and usage is available here.

This code is as an example and is not normative.

3.3 JAR

The above code packaged as a java archive.

4. Normative References

  1. Extensible Markup Language (XML) 1.0
  2. W3C XML Names
  3. W3C XML Schema Part 2, Datatypes, Appendix F

5. Informative References

  1. W3C XML Schema Part 1 - Structures
  2. W3C XML Schema Part 2, Datatypes
  3. The Style Attribute in HTML
  4. RELAX NG
  5. Friedl, Jeffrey. Mastering Regular Expressions (O'Reilly & Associates, 1998.)
  6. Dodds, Leigh. Parsing The Atom (XML.com, 2001.)