xml-xhtml-tips Archives - Tactics

Archives - Tactics

1 September 2000 - Think in elements, not tags

When I first learned HTML, it was all about using tags to mark up and format text. Start tags were the main weapon, and end tags were useful for turning off whatever I'd done with a start tag.

I rarely used </body> and </html> - browsers didn't care, so why should I? Similarly, I never marked the start of paragraphs, since the browser could accept as the equivalent of a paragraph mark.

XHTML requires a different approach, taken from XML. Tags are just markers for the beginnings and ends of elements, and these must be marked explicitly. It's the elements that matter, not just the tags. Elements may contain other elements, but they can't contain a portion of another element. All nesting must be clean and explicit.

For example, this markup is illegal in XHTML (and XML)

This is bold. This is bold italic. This is italic.

Because the b and i elements overlap, this markup isn't proper XML or XHTML. To make it work, you need to write

This is bold. This is bold italic. This is italic.

This is bold. This is bold italic. This is italic.

Why? XHTML expects elements to be containers, not just formatting markers in a stream. This container-based approach makes it much easier to process and store information based on its structure, and is critical to XHTML's eventual goal of letting developers mix and match new vocabularies with 'classic HTML'.

2 September 2000 - Empty elements and empty tags

HTML has always had a number of elements which don't include textual content directly, including:

br - just a line break
hr - horizontal rules don't contain text
img - references an image through an attribute value
input, param, area, col, base, meta, link, basefont - all values are set through attributes

In HTML, all of these elements were typically represented with just a start tag. Browsers understood that these elements didn't have textual content, so they never worried about finding an end tag for the element.

XML doesn't assume that browsers or other applications know anything about the vocabulary, so it doesn't allow document authors to assume their applications are very smart. All elements must have a clearly defined beginning and end.

An empty element may be represented in XML syntax by a start tag immediately followed by an end tag:

 

Alternatively, XML provides a shortcut syntax, called an empty tag, that puts a slash (/) right before the closing pointy bracket (>):

 

XHTML uses these empty tags for its empty elements, but it does so with a slight tweak to get around incompatibilities with a wide range of older browsers. Many browsers have problems with (producing an extra line break) and with (they either don't recognize it as a line break or they display the slash immediately afterward.)

To get around these difficulties, the XHTML 1.0 Recommendation suggests that all empty tags include a space before the slash:

 

The same approach can be used with empty tags containing attributes:

<img src="mypic.gif" alt="picture of me" />
<input type="text" name="email" maxlength="255" size="20" />

This relatively simple workaround gives XHTML developers the best of both worlds proper display in HTML browsers, and clean structures that can run through XML parsers and into XML tools. Empty tags are both a shortcut for developers and a workaround that helps XHTML fit into the existing HTML world.

4 September 2000 - Moving to lower case

One of the biggest complaints about XHTML is that all markup - element names, attribute names, and even some attribute values - must be lower case. Upper case and mixed case will both generate validation errors.

For some developers, this writer included, this has meant a fairly drastic change in hand-coding style, and the transition hasn't always been smooth. For tool developers, it can be even more of a problem, requiring picking through code to find all the markup and convert it to consistent lower case.

why

Despite the cost, there are some solid reasons why the XHTML Working Group had to choose a particular case for markup and stick to it.

The W3C Working Group that created XML considered internationalization a critical issue for building infrastructure on the Internet. XML moved away from the focus on ASCII or ISO-8859-1 (Latin-1) that SGML and HTML had often had, and built itself on Unicode.

Unicode provides enough character positions for nearly every language in the world, along with facilities for extensions and private character areas that should keep it from hitting a wall in the foreseeable future. Effectively, Unicode gives developers a way to mix English, Chinese, Basque, Hindi, Korean, Vietnamese, Russian, Japanese, Arabic, Urdu, and many many more languages in a document without having to split the document into smaller pieces or use strange escape sequences.

One consequence of using Unicode for markup - and in particular of allowing non-Latin characters to be used in element and attribute names - is that there are many languages that don't recognize conventions like upper and lower case. Requiring that XML processors perform case-folding brings a new level of complexity to the parsing operation. Since XML parsers were supposed to be reasonably simple and even small, this could have complicated matters enormously.

As a result, the XML Working Group decided that XML markup would be entirely case-sensitive, making br and BR two entirely different elements. It's much simpler and leaves far less room for conflicting interpretations of the same document.

It's not entirely clear why the XHTML working group chose to use lower case for XHTML markup rather than upper case, but it is clear that they would have faced complaints from partisans of the losing side in either case.

how to deal with it

For XHTML developers, the impact of this decision is pretty simple all markup must be in lower case. This means that all element names, attribute names, and some attribute values must be in lower case.

For element names, this means img instead of IMG, blockquote instead of BLOCKQUOTE, table instead of TABLE, and so on.

For attribute names, this means href instead of HREF, onclick instead of onClick, input instead of INPUT, and so on. The entire attribute name must be in lower case, even when it is a combination of two words, like most of the event handling attributes.

For attribute values, case sometimes matters and sometimes doesn't. Some attributes, like the alt attribute of the img element, are designed to accept free-form text descriptions. Similarly, domain names within URLs are not case-sensitive, so you can still write "http://SIMONSTL.com" instead of "http://simonstl.com" if you like - unless you're using the URI in a namespace declaration (like xmlns).

However, in cases where you select a value from a range of choices, you must use lower case. To create a text input on a form, you might use:

<input type="text" name="email" maxlength="255" size="20" />

but you can't use

<input type="TEXT" name="email" maxlength="255" size="20" />

Similarly, the method attribute now takes get or post, not GET or POST.

Generally speaking, if the value of an attribute is a choice that directly changes the way your XHTML is presented, it needs to be lower case. If it's pointing to an external resource, providing alternate text, or providing a numeric (hex) value, case still doesn't matter. When in doubt, use lower case.

The new pieces XHTML adds to HTML - like the XML declaration at the start of the document, need to be in lower case as well. (The DOCTYPE declaration should retain its familiar mixed case.)

The contents of comments and elements may of course use whatever case you need to properly convey your documents' message.

5 September 2000 - Quote those attribute values

HTML browsers were always quite forgiving about whether or not attribute values were quoted. The only times that attribute values really needed to be quoted were cases when the values contained whitespace. Otherwise, the browser would guess which parts of a tag were attribute values based on whitespace.

Markup like this was permitted, and remains common:

<IMG SRC=mypic.gif HEIGHT=20 WIDTH=30 ALT="This is my picture">

XML took a much stricter approach to syntax in general, requiring that all attribute values be quoted. Enforcing this requirement makes it much simpler for parsers to figure out which content in a tag is an attribute, and avoids the potential for chaos brought on by the possibility of quotes or equals signs inside of attribute content. XHTML enforces the same requirement.

XHTML requires that the element above look like:

<img src="mypic.gif" height="20" width="30" alt="This is my picture" />

or:

<img src='mypic.gif' height='20' width='30' alt='This is my picture' />

XHTML, like XML, permits developers to choose single or double quotes as attribute delimiters. These can vary from attribute to attribute, but the type of quote used to mark the beginning of an attribute value must also be used to mark its end.

Thus, you can use markup like:

<img src="mypic.gif" height='20' width="30" alt='This is my picture' />

but you can't use:

<img src="mypic.gif' height='20" width="30' alt='This is my picture" />

6 September 2000 - Namespaces in XHTML 1.0

XHTML 1.0 makes very simple use of XML's namespace facilities, using them only to label the elements within an XHTML document as XHTML. Despite some initial controversy, XHTML 1.0's use of namespaces now seems settled, stable, and simple to use.

While the DOCTYPE declaration has long served to identify HTML and XHTML documents, providing element-level identification is becoming more important as the W3C develops new ways to include multiple vocabularies in the same document. As Scalable Vector Graphics (SVG), Synchronized Multimedia Integration Language (SMIL), Mathematical Markup Language (MathML), and other task-specific formats emerge, the W3C would like to allow developers integrate them with XHTML.

To avoid the 'tag soup' that characterized HTML's growth until recently, the W3C has come up with a mechanism for identifying vocabularies uniquely. The Namespaces in XML Recommendation allows developers to associate Uniform Resource Identifiers (URIs) with element and attribute names. Programs which understand XML Namespaces can then work with a combination of the base element name and the URI with which it is associated.

There are two ways to associate a URI with an element name. The element may have a prefix in front of the element name, with a colon between the prefix and the element name:

xhtml:p          xhtml:img           xhtml:table

That prefix is then mapped to a URI. The other option avoids the use of prefixes, and provides a URI mapping for all elements without a prefix, associating them with the 'default namespace'. XHTML 1.0 uses this approach, sparing XHTML developers a lot of typing which isn't yet necessary.

The namespace URI for XHTML 1.0 documents (whatever DTD they may use) is http://www.w3.org/1999/xhtml. The namespace declaration, which is made using an attribute, appears in the html element of all conforming XHTML 1.0 documents:

<html xmlns="http://www.w3.org/1999/xhtml">

This declaration tells XML parser and XHTML processors to associate the namespace URI http://www.w3.org/1999/xhtml with all non-prefixed elements contained by the html element.

As long as the namespace declaration shown above appears in your html elements, your documents will have met XHTML 1.0's namespace requirements. (You still need to meet the other requirements, of course!) The namespace declaration made here will apply to all of the elements contained by the html element, unless another element redefines the default namespace.

Multiple namespaces shouldn't appear in conforming XHTML 1.0 documents. However, XHTML 1.0 includes a section describing how mixing elements from different namespaces might work, though such documents are not strictly conforming XHTML 1.0 documents.

Later tips will cover namespaces in more detail.

8 September 2000 - XHTML and the DOCTYPE declaration

Every XHTML 1.0 documents is required to have a Document Type Declaration (DOCTYPE declaration) that indicates which of the three XHTML Document Type Definitions (DTDs) is used by the document. Each DTD - Strict, Transitional, and Frameset - has a DOCTYPE declaration that must appear to identify which type of XHTML is in use.

The DOCTYPE declaration must appear before the html element, but after the XML Declaration, if one appears. This means that

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<-html xmlns="http://www.w3.org/1999/xhtml">
....

is legal (once you remove the dash in front of the html), but the following two examples are not:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<?xml version="1.0" encoding="UTF-8"?>
<-html xmlns="http://www.w3.org/1999/xhtml">
<!--this example is illegal.  It will load into an XML parser, but the XML
declaration will be ignored.-->
....

and:

<?xml version="1.0" encoding="UTF-8"?>
<-html xmlns="http://www.w3.org/1999/xhtml">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--this example is illegal, and will not load-->
....

XHTML 1.0 provides three Document Type Declarations, one each for the Strict, Transitional, and Frameset DTDs. The contents of the Document Type Declaration must match the version in the specification - even matching case - except for the URL at the end. This URL must point to a copy of the XHTML 1.0 DTD, either the copy at the W3C or another copy that will be accessible to XML parsers processing the document.

For example, the XHTML 1.0 specification presents the Document Type Declaration for the Strict DTD as:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "DTD/xhtml1-strict.dtd">

Since the XHTML 1.0 specification is stored at http://www.w3.org/TR/xhtml1, the relative URL is acceptable in this context. However, since most documents created using XHTML won't be stored on the W3C's servers, developers should use:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Similarly, the DOCTYPE declaration for the Transitional DTD should be:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

And the DOCTYPE declaration for the Frameset DTD should be:

<!DOCTYPE html 
   PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

If you examine the DTDs, this information is provided in the opening comments. Each DTD has a public identifier, which provides a network-independent way for software to identify the DTD, as well as a system identifier, the URL for reaching a copy of the DTD.

Whitespace in the DOCTYPE Declaration outside of the quoted contents will be normalized by the parser, so you can put this declaration on a single line or multiple lines as you find convenient.

For more information about this list, visit the main page.