xml-xhtml-tips Archives - Workarounds

Archives - Workarounds

2 September 2000 - Empty elements and empty tags

HTML has always had a number of elements which don't include textual content directly, including:

br - just a line break
hr - horizontal rules don't contain text
img - references an image through an attribute value
input, param, area, col, base, meta, link, basefont - all values are set through attributes

In HTML, all of these elements were typically represented with just a start tag. Browsers understood that these elements didn't have textual content, so they never worried about finding an end tag for the element.

XML doesn't assume that browsers or other applications know anything about the vocabulary, so it doesn't allow document authors to assume their applications are very smart. All elements must have a clearly defined beginning and end.

An empty element may be represented in XML syntax by a start tag immediately followed by an end tag:

 

Alternatively, XML provides a shortcut syntax, called an empty tag, that puts a slash (/) right before the closing pointy bracket (>):

 

XHTML uses these empty tags for its empty elements, but it does so with a slight tweak to get around incompatibilities with a wide range of older browsers. Many browsers have problems with (producing an extra line break) and with (they either don't recognize it as a line break or they display the slash immediately afterward.)

To get around these difficulties, the XHTML 1.0 Recommendation suggests that all empty tags include a space before the slash:

 

The same approach can be used with empty tags containing attributes:

<img src="mypic.gif" alt="picture of me" />
<input type="text" name="email" maxlength="255" size="20" />

This relatively simple workaround gives XHTML developers the best of both worlds proper display in HTML browsers, and clean structures that can run through XML parsers and into XML tools. Empty tags are both a shortcut for developers and a workaround that helps XHTML fit into the existing HTML world.

10 September 2000 - Comments and XHTML

Developers who have used comments in HTML as a place to hold information intended purely for human consumption will be happy to find that those comments work exactly the same way in XHTML that they worked in HTML. On the other hand, developers who have used comments to hide scripts and style sheets, or to pass information to programs, may find that it's time to update their documents.

XHTML comments look exactly like HTML comments:

<!--This is a comment. Heh heh. -->

Like HTML comments, XHTML comments can appear before the start tag of the html element, in text within the document, or after the end tag of the html element.

If the XML declaration is used, it should appear first in the document, and any comments should appear after. If comments or whitespace appear before the XML declaration, it won't be recognized as the XML declaration.

Developers who have been using comments to hide scripts and style sheets may encounter some problems. XML takes comments very seriously - XML parsers aren't even required to report the contents of comments to applications. That means that if you use the script- hiding technique shown below, your scripting code may simply disappear in certain XML-oriented applications:

<script type="text/javascript">
<!--
...
if (i<12) {
}
...
//-->
</script>

While the comments will keep Netscape 2.0 and earlier browsers from misinterpreting the < in the if statement as markup, XML parsers may discard the script contents entirely.

Although XML parsers aren't commonly used in Web browsers (yet, at any rate), you may find yourself taking advantage of more and more XML tools, like content management systems and transformation engines, that weren't built for XHTML in particular - they just read it as XML.

XML offers a solution to this situation: CDATA sections. They allow you to mark text that shouldn't be parsed, and can be used inside script elements with script comments to keep them from interfering with the scripting engine:

<script type="text/javascript">
//<![CDATA[
...
if (i<12) {
}
...
//]]>
</script>

CDATA sections begin with <![CDATA[ and end with ]]>. If your script's code includes a ]]>, you'll need to break it up a little - ] ]> or ]] >, as seems appropriate. Otherwise, XML parsers will interpret the CDATA section as ending at the first ]]> they find, corrupting your script (by stripping the ]]>) and then leaving the second ]]> in the text or reporting a parsing error.

You can use the characters <, >, and & anywhere you like inside a CDATA section, so they're quite useful for more than script-hiding.

Other tools that use comments, like server-side includes, may want to consider shifting to a processing instruction syntax (<? instead of ) if their templates need to pass through XML-oriented environments.

26 September 2000 - Dealing with Markup Characters, Part I: Entities

XHTML is less forgiving of stray markup characters than HTML, and provides additional mechanisms to help developers 'hide' these characters from parsers. Some of those mechanisms will be familiar to HTML developers, while others existed but were obscure and yet another is new to XHTML.

Just like HTML, XHTML provides built-in entities for representing markup characters inside of markup without disrupting parsing:

Entity Reference	Character Represented
&	Ampersand (&)
<	Less Than (<)
>	Greater Than (>)
'	Apostrophe (')
"	Quote (")

HTML parsers were fairly relaxed about letting ampersands and greater than symbols appear in some places within a document, especially within URLs, but XML parsers are much less forgiving and XHTML requires conformance to the XML standard.

To stay out of trouble the simplest possible way, use the entity references everytime you need to use these characters as something other than markup. (You only need to use the references for quotes and apostrophes inside of attribute values, and then only to avoid conflicting with the quotes containing the attribute value.)

For example, we'll include the XML document below in an XHTML document:

<example>This document contains &lt;, represented by an &amp;lt;
entity.</example>

The XML document contains element markup (<, >), and some trickier bits of escaping designed to leave < in the document (&lt;). The element markup is easily handled with < and >. If we want to keep the XML document looking as is for an example in the XHTML version, we'll have to replace the initial ampersands of both entity references with another ampersand reference, as &lt; and &amp;lt;.

If we wanted to put this into a code element in XHTML, it might look like:

<code>&lt;example&gt;This document contains &amp;lt;, represented 
by an &amp;amp;lt; entity.&lt;/example&gt;</code>

In the browser, all of this would be presented as:

<example>This document contains &lt;, represented by an &amp;lt;
entity.</example>

To make this work, you just need to apply the same techniques which were available in HTML and do it consistently.

3 October 2000 - Dealing with Markup Characters, Part II: CDATA sections

While replacing markup characters with XML's built-in entities may work in many cases, it still leaves a few difficult situations unresolved. The scripting engines built into most Web browsers won't accept scripts that substitute entities for <, >, and &. Fortunately, CDATA sections offer an XML-safe alternative that works in some HTML situations.

CDATA sections are an XML feature that tells the parser to ignore all occurrences of markup between the initial <![CDATA[ and the closing ]]>. These can be included in scripts while hidden by script comments:

<script type="text/javascript">
//<![CDATA[

...

if (i<12) {

}

...

//]]>

</script>

If your script includes a ]]>, write it as ] ]> or ]] > to avoid ending the CDATA section prematurely. In theory, you can also use CDATA sections to escape markup anywhere in a document, in markup examples, for instance:

<code><![CDATA[

<example>This is an example.  You should be able to see the
start and end tags here.</example>

</code>]]>

Unfortunately, this isn't likely to work on regular HTML browsers - they won't know what to do with the opening <![CDATA[, may display the closing ]]>, and will probably interpret the markup characters included in the section. Someday...

For more information about this list, visit the main page.