Gorille - a tool for specifying XML Unicode usage.


com.simonstl.common Common pieces for the MOE, TAM, and Gorille packages.
com.simonstl.gorille Code for working with XML documents as both markup and text.


Gorille - a tool for specifying XML Unicode usage.

Introduction | Warnings | Future Directions | License | Acknowledgments | Download

Gorille is a small Java package designed to let developers of various kinds of XML processors test the content and names of XML structures in their XML documents. While Gorille ships with test files for both XML 1.0 and the draft XML 1.1, you can create your own configuration files as well.

Introduction - Why Gorille?

Gorille attempts to provide a standard means of addressing complex issues between XML and Unicode. The initial release of XML 1.1 has provoked discussion, much like the earlier Blueberry(1 2) discussions, and it doesn't appear that issues emerging from XML's lack of direct synchronization with Unicode are going to disappear any time soon. Gorille sidesteps the notion of fixed Unicode character assignments dictated by W3C specification, and opens the field to character listings of whatever form seem necessary. XML 1.0 and XML 1.1 conventions are supported (in the xml10chars.xml and xml11chars.xml configuration files), but developers can go their own route as well (as demonstrated by the asciichars.xml file).

The main class is CharRules, where most of the processing logic takes place. The supporting classes provide containers for information (CharRange, CharRanges), file loading support (CharRulesLoader), or testing capabilities (the command-line CharTester and the simple test suite TestCharRules). In version 0.4, the CharRulesGen class for generating fixed Java classes from XML configuration files was added, along with Xml10Rules, Xml11Rules, AsciiRules - code it generated.

At the moment, I'm putting most of my attention into the Ripper class, which reports XML documents as a context and character events. While Ripper is not an XML parser, it is designed as a foundation for processing XML at the character level and possibly for the creation of more proper XML parsers.

Gorille is named after an unruly brute who stars in George Brassens' Le Gorille (French English). Gare au gorille!


Gorille is thoroughly experimental and perhaps not even a good idea. Gorille will definitely see future use in Markup Object Events (MOE) for name and content-checking, but its infinite configurability certainly opens enormous possibilities for very very bad practice. Using Gorille you can, for instance, require that all content be represented as control characters and all names as ideographs.

Future directions

This is just getting started. Gorille will eventually sprout a SAX filter interface for checking content and names, as well as connections to MOE. A tool which performs Gorille checking on Java Readers seems like a good future idea as well.

There is an enormous amount of cleanup needed in the Ripper class, as well as in the Context classes. As proof-of-concept, I think they're okay, but there's a long ways to go. There's also a lot of work to be done in integrating the rest of Gorille and Ents with the Ripper context, which would be a big step toward a 'real' XML parser. DOCTYPE processing is also on the list, though I plan to keep that out of the main Ripper class and treat it as a separate layer. See Toward a Layered Model for XML for some reasoning.

A formal test structure, using JUnit, is also high on the list of priorities.


The contents of this package are subject to the Mozilla Public License Version 1.1 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.mozilla.org/MPL/.

Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.

The Original Code is available at http://simonstl.com/projects/fragment/original.

The Initial Developer of the Original Code is Simon St.Laurent. Portions created by Simon St.Laurent are Copyright (C) 2001 Simon St.Laurent. All Rights Reserved.



Thanks to Elliotte Rusty Harold for pointing out surrogate pair issues and John Cowan for pointing the way to a solution.

Thanks to the xml-dev mailing list for its continuing discussions of all these issues.

Thanks to BBC Correspondent John Simpson's book, A Mad World, My Masters for leading me to the Le Gorille song.


A download is available.

Introduction | Warnings | Future Directions | License | Acknowledgments | Download