xml-xhtml-tips Archives - Tools

Archives - Tools

20 September 2000 - A Closer Look at Tidy

HTML Tidy (often called Tidy) is an HTML and XHTML cleanup tool created by one of HTML's leading lights, Dave Raggett of the W3C. Originally written to help developers create valid HTML, it now also helps developers create valid XHTML.

Tidy is a command-line utility that provides a number of options for cleaning up HTML and XHTML, as well as a few extra features like slide creation based on heading levels. It cleans up start and end tags along with quoting attributes, adds end tags where appropriate, sorts out some common badly structured HTML, like lists without containers and horizontal rules stuck inside of headlines. (I've made that last mistake for years.)

Tidy can also clean up formatting-oriented code, replacing it with Cascading Style Sheets when appropriate. It can work around some flavors of non-HTML markup, including Active Server Pages and PHP. Most important for XHTML, it offers an '-asxml' option that makes Tidy generate XHTML.

Tidy also provides a configuration file that makes it easy to set up Tidy once and not have to use command line options repeatedly. Using these options, you can make XHTML that's ready for non-validating XML parsers, complete with numeric character references replacing the named entities XML parsers may or may not process.

Some HTML is just too broken for Tidy to handle. While Tidy will do its best, and report issues it's not certain it handled properly, some issues will raise errors rather than warnings, requiring manual intervention.

Tidy is written in C and available under an open source license, and compiled versions are available for a large number of platforms. A complete list of platform-specific binaries, including cases where Tidy has been built into a HTML development environments, is available.

A Java port is also available.

Dave Raggett does ask that developers who want to say thanks for Tidy send him a postcard from their home area - the mailing address is on the main Tidy page.

19 September 2000 - Using HTML editors for XHTML

While many developers are used to HTML editing systems having their own style of output, XHTML requires a somewhat higher degree of control over that output. During the transition period from HTML to XHTML, developers will need to monitor their usage of HTML editors closely.

More and more HTML editors are producing cleaner code - including start and end tags consistently, providing users with choices between Strict, Transitional, Frameset, and 'anything goes', as well as support for key technologies like Cascading Style Sheets (CSS).

These improvements make it much easier to produce consistent HTML which can be converted to XHTML without loss, but there hasn't really been a rush to release new versions of software to support XHTML, especially as a core function. Some tools, like Evrsoft's 1st Page 2000, include the W3C's Tidy as an auxiliary function, but still focus on HTML internally rather than XHTML.

Tidy and tools like it are going to provide a bridge between HTML and XHTML for a long while. Even as more software companies integrate XHTML support with their products, users probably won't move as quickly.

Web developers can take advantage of Tidy where it is built into software in order to skip a step, but will probably have to keep a copy of Tidy around for handling information coming from sources they don't control.

Although one of the early complaints about XHTML was that it makes life for hand-coders more difficult, requiring them to keep track of balanced start and end tags manually, hand-coders may actually have an easier time producing clean XHTML than those using HTML editors, at least for the near future.

Developers looking for a pure XHTML solution may want to look at Mozquito Factory, an XHTML editor with a focus on creating HTML-compliant smart forms.

15 September 2000 - Style sheets from HTML to XML and back

As the W3C works to prune HTML of its formatting-oriented past, the tool of choice for formatting is becoming Cascading Style Sheets (CSS). While weak (though slowly improving) implementations have held CSS back, the W3C is pressing forward with CSS, extending it to give designers finer control over the look and feel of their pages. At the same time, the XML world has developed a very different approach to style sheets in XSL, the Extensible Stylesheet Language, which may also have something to contribute to XHTML developers.

Cascading Style Sheets allow developers to assign formatting properties to particular elements and combinations of elements within documents. CSS makes it easy to tell Web browsers thing like "make all h1 elements 24pt sans-serif type in bold purple" or "make all list items nested inside list items italic". CSS provides a set of rules for allow these statements about formatting to interact, and for allowing multiple sets of statements to interact.

While complete CSS Level 1 implementations are only starting to appear (the latest versions of Mozilla, Opera, and Internet Explorer 5 offer substantial support), CSS is critical to the W3C's hopes of making the Strict flavor of XHTML dominate as XHTML moves forward into XHTML 1.1 and XHTML 2.0. Without real CSS implementations, designers aren't going to be able to create the pages customers demand without falling back on the 'legacy' formatting tools. In the case of frames, where the W3C's proposed replacement would rely on CSS positioning, this is especially critical.

The W3C is building CSS to support both sides of XHTML's heritage - both HTML and XML. On the XML side, however, a different approach to style may offer tools to XHTML developers. Extensible Stylesheet Language (XSL) doesn't just describe what different element and element combinations should look like. Instead, it provides rules for transforming source documents into result documents, and a result document vocabulary which is strictly formatting-oriented, called formatting objects.

While formatting objects are still in development, the tool for moving from a source document into a result document is available today. Called XSLT (XSL Transformations), it may prove useful to developers who need to convert XML documents into XHTML. At the same time, it could be applied to XHTML documents, converting them into formatting objects or other XML vocabularies. While XSLT isn't as familiar as the Document Object Model (DOM), it has a growing user base and more and more XSLT material is becoming available every day.

14 September 2000 - The risks of using XML tools with XHTML

Apart from the difficulties developers may encounter in learning how to apply tools built with XML in mind, there are a few potential tricks brought on by the XML approach that might throw XHTML developers.

While XML includes all of the parts most HTML developers are familiar with - elements, attributes, comments, and the DOCTYPE declaration - it also includes some extra parts like stronger use of the DTD, processing instructions, and different priorities for preserving information.

XML 1.0 allows, and in some ways expects, that parsers will transmit only a 'finished' version of a document to an application. This 'finished' version may come without comments, will have entities fully expanded (to single characters in the XHTML case), and may include default values for unspecified attributes, but no DOCTYPE information.

Applications that save a document back out as an XML document may be saving a document that contains the same information, but in a very different form from what was supplied. XML processors may also not know to leave a space after the element name in an empty tag (like <br />), as this workaround is specific to the needs of HTML browsers, not XML processing.

As more and more XHTML developers start to take advantage of the toolkits already built for processing and storing XML, the life cycle of an XHTML document will become more important. While Web developers have grown used to (and sometimes weary of) the changes that HTML editors will make in documents, the prospect of changes happening in other environments - from Save As... in a browser to incoming and outgoing information in an XML-based data storage system - may not be so appealing.

Even within a browser context, treating XHTML as XML may require some extra work. Parsers have enough latitude that some of them (non-validating parsers) can ignore external resources, like the DTDs used by XHTML to assign default values to attributes and to define character entities. While some applications (notably Mozilla) are making an effort to address these issues in software, many applications, especially home-grown applications, may not. All of this is written into the XML 1.0 specification, but many of these details aren't well-understood.

These issues don't appear in every XML processor, but they appear in many, without ever violating even the spirit of XML 1.0. 'Round-tripping' a document through a parser and back can be a tricky business, and there is already at least one guide to what a 'safe' subset of XML might look like. While Common XML may be useful for developers creating XML applications from scratch, the recommendations in that document don't fit with the tools XHTML has already used. (They may serve as useful warnings, however.)

XHTML developers who want to use XML tools will have to carefully examine the tools to ensure that they support all of the features XHTML demands for strict conformance. This may mean extra work in making XML tools XHTML-safe, as the 4xt.org group has done with XT, a popular XSLT processor, or it may mean adding a layer of code that makes sure DOCTYPE declarations appear in their proper place and that empty tags are presented properly.

Although XML and XML toolkits have an enormous amount to offer XHTML developers, XHTML developers need to make sure that the toolkits fit their needs completely.

For more general information on XML interoperability issues, see http://www.simonstl.com/articles/interop/.

And yes, I am the editor of that Common XML document, but it derives from the work of the SML-DEV mailing list.

13 September 2000 - Using SAX for lightweight XHTML processing

Using SAX for lightweight XHTML processing

While the Document Object Model (DOM) is very useful for kinds of processing that require complete access to the entire document at the same time, there are many cases where documents are being created, filtered, or transformed but developers never need access to the entire document at once.

Recognizing the need for a lightweight API, David Megginson and the XML-Dev mailing list created the Simple API for XML (SAX), a Java event-based interface that uses an XML parser to 'read' a document to an application, announcing events like the start of an element, the appearance of text, and the end of an element.

Initially devised as a standard for communications between XML parsers and the applications using XML parsers, the SAX standard quickly spread and developed new variations. By creating programs that both accepted SAX events from the parser and transmitted SAX events to the application, developers could create 'filters' that extracted information from documents, restructured information, or deleted it entirely. By creating small applications that only accepted SAX events and wrote them back out as XML documents, developers could then use SAX events to create documents.

SAX has developed into a general-purpose API for handling XML-based information, including XHTML. It has spread beyond its Java roots to Python, Perl, and C++. Microsoft's latest releases of their MSXML parser include a SAX2 implementation that can be used from C++, Visual Basic, and scripting languages.

Developers who want to create XHTML from within programs now have an alternative to writing large flows of text. SAX allows developers to receive and describe documents as discrete events, passing off syntax reading (parsing) and syntax writing (XML output) to other tools, all while avoiding the overhead of large document trees in memory. You still need to make sure that elements begin and end in the proper order, but it becomes easier to abstract documents a little further from the actual markup.

SAX is now in its second generation, called SAX2. For more on SAX, see http://www.megginson.com/SAX/.

12 September 2000 - Taking the DOM beyond Dynamic HTML

Developers who are familiar with dynamic HTML - and in particular the W3C DOM - have a head start on creating and processing XML and XHTML. The Document Object Model (DOM) provides an abstract view of an XML, HTML, or XHTML document which can be manipulated using various scripting and programming languages.

Most dynamic HTML developers are familiar with JavaScript or VBScript, but the W3C DOM provides JavaScript/ECMAScript and Java bindings, along with a CORBA IDL which makes it easier to port the DOM to other environments.

While the DOM is missing a lot of parts (like a common interface for loading documents and saving them back out), it provides a foundation which developers can take from Web browsers and JavaScript to back-end servlets written in Java, Active Server Pages (ASP) written in VBScript, or COM implementations written using C or Delphi, to name just a few.

The DOM represents a document as a set of nodes - a document node which contains all of the rest of the nodes, and then element and attribute nodes for storing element and attribute content, text nodes for containing text, and comment nodes containing comments. Mapping an XML or XHTML document to a DOM is pretty straightforward, and generally much easier than mapping old HTML to a DOM.

If you've developed dynamic HTML before, even if you were using an environment like Internet Explorer, which provides an enormous number of non-standard extensions to the DOM, you already have experience in working with this key abstraction of a document. The skills you have today for programming interfaces in browsers can be easily transferred to document creation, manipulation, and other processing including (of course) interface-building.

Many XML parsers, including Apache's Xerces, IBM's XML4J, Sun's Project X (now Apache's Crimson), and Microsoft's MSXML provide some level of support for the DOM. Recent generations of browsers, including Microsoft's Internet Explorer 5.x, Mozilla, and Netscape Communicator 6.x, provide solid or improving support for the DOM Level One.

Work on the DOM is continuing at the W3C. While Level One of the DOM has been complete for nearly two years, Level Two is currently a Candidate Recommendation and Level Three is just getting started. Each Level adds new material, rather than functioning as a version number.

The DOM is not without its critics, of course. Storing document trees in memory as interconnected objects can bring enormous overhead when large documents are being processed, and the DOM interface sometimes feels clunky to developers who find its 'document nature' counter-intuitive.

There are a number of alternatives to the DOM, including the Simple API for XML, a lightweight interface for processing XML (and XHTML documents) which doesn't create a document tree in memory. JDOM builds a tree, but provides an interface aimed specifically at Java programmers. There are a number of tools for data binding and mapping XML to object structures, though most of these go well beyond the needs of XHTML developers.

For more information about this list, visit the main page.