Office 2000 cleanup using a Java FilterReader


I finally bought a copy of Word 2000, and was happy to see that the HTML output looked like it might be cleaner than what Word 97 had put out. Though it used some XML, its creators seemed to enjoy the laxness of HTML syntax, and put strange conditionals into the structure. As a result, you can't normally run Office 2000 output through an XML parser and do anything with it.

The O2KCleaner is a Java class that extends FilterReader, allowing it to read the flow of characters coming in and correct syntactical (not structural) problems as they appear before sending the information on to an XML parser. It preserves all namespace information, though I had to encode the if and end if conditionals in my own namespace of markup.

The O2KCleaner returns well-formed XML, not valid XML or XHTML. It currently requires a connection to the Internet, as Office 2000 likes to use HTML entities, though you could configure a parser to use the public identifiers provided for entity processing to avoided repeated trips to the W3C's entity files.

My testing so far has been more limited than I'd like, as I only own Word 2000. My reading of the Office 2000 format documentation suggests that this should also work on other Office formats. I've tested in on large and small documents, including some large (200K) files using revisions.

Right now, O2KCleaner is stamping all output as using the ISO-8859-1 (Latin-1) encoding. If you are working with other encodings, you need to use the setEncoding() method to change that. I'm working on a more generalized XML encoding handler, but it isn't ready yet.

As this is a very early release, the code isn't optimized and is only partially commented. Improvements will come, promise.

The O2KCleaner is distributed under the Mozilla Public License (MPL), version 1.1. Feedback and contributions are welcome. Contributions will be acknowledged. Elliotte Rusty Harold substantially improved the readability of version 0.03, and caught an encoding mismatch.

Thanks to Elliotte Rusty Harold for his fine book Java I/O, which made this adventure possible. (David Flanagan's Java in a Nutshell was helpful for test code as well.)

Comments are welcome.

Download here:

Last updated 08/02/00