XHTML — myths and reality

It is difficult to find a web development language today which is as misunderstood as XHTML. In the following article we’ll examine why, sort out a few concepts that frequently confuse authors, and offer practical suggestions on real–life XHTML usage.

The intended audience for this article are those developers who consider using XHTML for the first time, but also authors and content producers who wants to learn more about the topic of extensible markup languages.

Tina Holmboe

  1. Introduction
  2. The Purpose of XHTML
  3. XHTML and the Content Type
  4. Strictly XHTML
  5. Lack of support
  6. Content–Negotiation
  7. Recommendations
  8. References
  9. Document Information

Introduction

When the first informal version of HTML was released in 1992, it was described as a “hypertext mark-up language”[1] and “an SGML format”. SGML, a powerful meta–language for creating markup languages, was developed between 1969 and 1980[2] but grew out of work done as early as 1945[3].

Powerful, and close to infinitely adaptable, SGML has been widely adopted, particularly in high–tech industry, government, and academia[4], and given rise to a great number of applications. It was a logical choice with which to create the markup language of the World Wide Web — but it was also too complex for the casual author. To make HTML easy in use, some corners were cut; so much so that some claim it does not fulfil the requirements to be an SGML application[5].

When the W3C decided to create XML in 1996, the rationale was given as "The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML"[6]. It was further described as “… an extremely simple dialect of SGML …” [7].

Since the design goals of XML itself partially mirrored those of the original HTML, it was logical for work to begin on formulating an XML–based markup language. XHTML 1.0[8] became a W3C Recommendation in January 2000, followed by XHTML 1.1[9] in May 2001. Work on XHTML 2.0[10] is ongoing as of September 2008. Another project is underway in the XHTML WG[11] to produce a revision 1.2, containing features such as ARIA[12], roles[13], the ACCESS module[14], and RDFa[15].

It is worth noting that the above means, in short, that XML is still SGML. The rationale is, in David Megginson’s SGML FAQ[16], given as follows: “Unlike HTML, XML is not an SGML application -- instead, it’s a set of simple conventions for using SGML without some of the more esoteric features.”

Summary The XHTML family of languages is created and maintained by the XHTML WG, with the current version being 1.1. A revision 1.2 and a new version 2.0 is in the pipeline.

The Purpose of XHTML

XHTML has, when it comes down to it, two distinct reasons for existing. The first is to shift the markup language of the world wide web from SGML to XML, and in the process clean up some of the leftover crud that has plagued the WWW for many years. The 1.* series of XHTML is only the first step in this process.

The second, and potentially even more important, reason is to add the XML ability to extend the language through namespaces. This will make it possible for an author to express more structures and richer semantics than is possible with HTML today. In effect XHTML inherits the possibility of supporting more than one language — instead of extending HTML in a monolithic fashion, XHTML can be extended through modules, where each module define a specific subset of the language.

This, theoretically, means extension of the language can be done without the need for a browser upgrade.

In practise it is reasonably easy to create a browser which can parse even multi–namespace XML documents, apply CSS to them and use JS with a node tree to manipulate the resulting bits and pieces. It is far more complicated to construct the browser in such a way that it can present, regardless of the actual method involved, the meaning of structures. This goes beyond what can be achieved with snippets of JavaScript which add behaviour to elements: despite the best intentions of page authors the semantics of elements must be communicated to browsers so that the presentation — visual, aural, tactile — can be adapted to the needs of the user.

Achieving this today, with HTML, is simple enough. A browser “knows” what the <h1> tag is supposed to mean, and can present the content of the header in a way which makes sense to the user, whether by speaking it out loud, showing it in a particular font, weight, and colour on a screen, or by raising a distinct set of dots on a Braille strip.

In a mixed namespace XML document, on the other hand, the presence of an unknown tag. say <foo:bar>, will make no sense to the browser, and so it cannot tell what form of presentation is appropriate to convey the meaning of the element. Some work, notably the W3C’s semantic web activity[17], is being undertaken to find a solution to this problem.

It is, of course, possible to write a browser to understand more than one markup language, and so give authors much more flexibility as to which elements they can use to express concepts and ideas. In this way an XML–based browser can parse both generic XML and “know about” specific languages such as XHTML, MathML and SVG.

A few more pitfalls exists if you want to use XHTML on the World Wide Web, as we shall see, but the language family can also be used for information storage purposes in other contexts, where it is processed by more specialised tools.

Summary XHTML is meant to make the use of XML–based languages in end–user applications such as browsers easy, but can also be used for various data processing and storage purposes in situations where the web is only one of several channels. XHTML take advantage of the extensibility of XML to support multiple namespaces and through them languages.

XHTML and the Content Type

When content is transmitted from a web server to a web browser by way of HTTP, a mechanism exists for identifying the type of data: the aptly named Content–Type[18] header. By reading this header, a browser can quickly determine how to deal with the content it is being sent — whether to render, process in some manner, or even offer a download prompt.

HTTP Content–Type values are Internet Media Types[19], and ought be duly registered with the Internet Assigned Number Authority (IANA)[20]. By using this registry, browser makers can more easily add support for media types and be reasonably certain that their support is cross–browser.

In order to tell a browser that it is indeed XHTML we want it to deal with, the correct content–type should be set. Keep in mind that those browsers today which have the capability of handling XML have two parsers — one for XML, and one for HTML. The browser uses the media type to decide which of the two to be used.

An important aspect of the HTTP specification is that a Content–Type sent from the server is authoritative[21], which means that it will override constructions such as this one:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
   

For XHTML the appropriate content–type is application/xhtml+xml. Any data with this type is considered by the browser as XML, specifically XHTML, and treated according to XML rules — but only if the browser is able to parse XML in the first place, and able to “understand” XHTML. Not all do.

In order to ensure that authors could start using XHTML, get a feel for the changed rules, and investigate the possibilities of the new language — yet still be able to use the markup in current browsers — it was decided to allow use of the text/html content type.

This has a number of consequences: well–formedness errors will not be detected, HTML rules for applying styles and using the DOM will apply, and mixing of namespaces, i.e. having an XHTML document contain MathML or similar XML–based language, cannot be done without relying on browser–specific methods for handling “alien” markup embedded in HTML.

When using XHTML syntax rules in an HTML user agent, it is strongly recommended that you also follow the compatibility guidelines[22] laid out in the XHTML specification to avoid any nasty surprises. Most browsers handle XHTML as yet another form of HTML tagsoup so there shouldn’t be much of a problem.

Summary Browsers use the authoritative HTTP Content–Type header to determine what kind of content being received. It is the value of this header which decides whether to use an XML or HTML parser on the data. XHTML, even 1.1, can be labelled as text/html if you accept that your page is treated as HTML, and not as XHTML.

Strictly XHTML

The idea that XHTML is “stricter” than HTML springs from the fact that, as an application of XML, documents written in XHTML are subject to XML processing rules, specifically the handling of “fatal errors”.

When a conforming XML processor detects a fatal error it, in standards terminology, “MUST NOT” continue processing in the normal way — i.e. there is an absolute prohibition on error recovery if a well–formedness violation is detected.

Several errors are considered fatal — violation of the well-formedness constraint[23] is the most commonly known of these. A number of other fatal errors exist related to the handling of entities. These are outside the scope of this article.

It is not a fatal error to have validity constraint violations, and so the handling of syntax errors other than those involving well–formedness is lenient, and such errors should, if requested, simply be reported to the user.

Another major difference exists between SGML–based HTML, and XML–based XHTML: in an SGML DTD it is possible to specify which elements should be excluded from appearing inside which other elements — illustrated by the definition of P and FORM below:

<!ELEMENT P — O (%inline;)*                  -- paragraph -->
<!ELEMENT FORM — - (%block;|SCRIPT)+ -(FORM) -- interactive form -->
   

In plain English this specifies that for a document to be valid, a P cannot contain anything but inline elements, and a FORM cannot contain another FORM. Such a prohibition is not possible in XML, and rules for which element can go in which must, subsequently, be specified in prose[24], making situations like the following undetectable by formal validators and XML processors alike:

<form action="x">
 <div>
  <form action="y">
   <fieldset>
     <legend>z</legend>
   </fieldset>
  </form>
 </div>
</form>
   

The construct above would give an HTML validator the jiffies, whilst an XML parser, even a validating one, wouldn’t bat an eyelash. Even so it should be noted that in XML Schema[25] such constraints as described above can be implemented.

On the other hand HTML does allow certain constructions which XHTML does not, such as for instance omitted end–tags on certain non–empty elements, as well as omitted end–tags on empty elements, making the following valid:

<p>here is a paragraph.<p>here is another paragraph.
<br><hr>
   

However: this form of markup is not sloppy, but rather precisely defined in the HTML 4.01 DTD as allowed per SGML rules and as such cannot be considered “less strict”. In addition HTML does not allow incorrect nesting of elements, but make the presence of a DOCTYPE mandatory.

With this knowledge in hand we can conclude that while XML processors have the requirement to stop on fatal errors, not all syntax violations in XHTML can be formally specified as they can in HTML, that strictly speaking an HTML document is an HTML document only when fully valid[26], and that XML documents may very well be syntactically invalid but remain XML documents.

For authors the important thing to note is that it is the processing application which is more, or less, strict; not the language itself — and that only when sent with the appropriate content–type will an XHTML document be handled by a conforming XML processor.

Summary When sent as text/html, your XHTML document will not be subject to the XML processing rules described above, and so the browser will not treat it any differently than if it was HTML.

Lack of support

Lack of support for XHTML is a fact of life on the web in 2008. Prior to the 3.0 series of Firefox the XHTML processor in Gecko was so poor that Mozilla’s own engineers recommended against it[27]; no version of Internet Explorer up to, and including, IE 8 support XHTML at all, and a number of other browsers such as Lynx were never written to handle XML in the first place.

Several other user–agents, such as search engines[28], are in the same situation: no support for XML or XHTML.

As a side note it is actually possible to fool Internet Explorer up to at least version 7 into processing XHTML sent as application/xhtml+xml. The trick is to take advantage of what Microsoft refer to as "mimetype sniffing"[29], which in brief mean that IE will override the content–type header under certain circumstances, such as for instance if the URI contain the phrase .html. An example of this can be seen by opening http://www.w3.org/International/tests/sec-ruby-markup-1.html which, despite the HTML–look–a–like extension in the URI, is served up as application/xhtml+xml (October 2nd 2008) and as such should cause a download prompt in standard versions of IE.

The end result is, however, the same. It is the HTML tagsoup parser in Internet Explorer which process the markup, leaving at least this author to question the purpose of such an exercise.

It is also possible, albeit not recommended, to send XHTML documents as either application/xml or text/xml, but while this will enable XML–capable browsers without XHTML support to actually parse the document and even to an extent style it, constructs such as the following will not work:

<a href="URI">link text</a>
   

Simply parsing an XML–based language is not, as we see, enough for the browser to “understand” that the A–element represent a hyperlink. This can be partially alleviated by use of the XLink[30] module which supply a number of attributes such as xlink:href. By teaching the browser that the value of this attribute should be treated as the URI in a hyperlink, even generic XML–based languages can be given A–element equivalents — given, of course, that the XML–based language use the XLink module.

Summary In order to ensure that you support the widest possible range of browsers and other user–agents such a search engines, you should either consider using HTML 4.01 as text/html, or use content negotiation to serve either HTML or XHTML as appropriate.

Content–Negotiation

The idea of a web server negotiating with a user–agent in order to figure out which response will best serve the user is not new. It was introduced in HTTP 1.1 as early as 1999, but is woefully underutilised. This part of the article will focus on server–based content negotiation by way of the Accept HTTP request header.

In theory negotiating for HTML or XHTML is simple enough. Look at the HTTP Accept header. Determine which of text/html and application/xhtml+xml has the highest priority. Send, transforming it if a version is not already cached, the document in the correct markup language.

In practise it becomes much more difficult by the fact that several generations of Internet Explorer include “*/”" in the Accept string, thereby claiming they support absolutely everything. Giving proof for why this is patently absurd is left to the reader.

Luckily one can make the argument for disregarding “*/*”, and on close examination it becomes clear that the HTTP specification does not explicitly prohibit such behaviour, nor does it state authoritatively what to do in such a situation. Subsequently we’ll take the pragmatic route, and forget about IE’s claim. The following algorithm, described by way of Perl, analyses an HTTP Accept header and returns either ’xhtml’ or ’html’ depending on which is judged best.

The following is a pragmatic approach to solving the “HTML or XHTML?” equation, and will not work as a generic Accept parser.

sub examineAccept {
 my $accept = shift() ;
 $accept =~ s#(\n|\r)##g ;
                              #  ------------------------------------------ — #
                              # — This is the conservative default.         — #
                              #  ------------------------------------------ — #
                               #
 my $contentType = 'text/html' ;

                              #  ------------------------------------------ — #
                              # — If, at this spot, there is no XHTML       — #
                              # — explicitly mentioned, we return 'html',   — #
                              # — and vice versa.                           — #
                              #  ------------------------------------------ — #
                               #
 return('html') if ( $accept !~ m#\Qapplication/xhtml+xml\E#i ) ;
 return('xhtml') if ( $accept !~ m#\Qtext/html\E#i ) ;
                              
                              #  ------------------------------------------ — #
                              # — We explicitly retrieve the Q-parameter    — #
                              # — for text/html and application/xhtml+xml   — #
                              #  ------------------------------------------ — #
                               #
 my($html_quality) = $accept =~ m#text/html(?:;\s*q\s*=\s*([0-9\.]+))?# ;
 if ( $html_quality eq '' ) {
  $html_quality = '1.0' ;
 }

 my($xhtml_quality) = $accept =~ m#application/xhtml+xml(?:;\s*q\s*=\s*([0-9\.]+))?# ;
 if ( $xhtml_quality eq '' ) {
  $xhtml_quality = '1.0' ;
 }

                              #  ------------------------------------------ — #
                              # — IF they are of equal weight, we return    — #
                              # — the default.                              — #
                              #  ------------------------------------------ — #
                               #
 if ( $html_quality == $xhtml_quality ) {
  return($contentType) ;
 }

                              #  ------------------------------------------ — #
                              # — If the Q-parameter of text/html is        — #
                              # — heavier, return 'html'                    — #
                              #  ------------------------------------------ — #
                               #
 if ( $html_quality > $xhtml_quality ) {
  return('html') ;
 }

                              #  ------------------------------------------ — #
                              # — And vice versa.                           — #
                              #  ------------------------------------------ — #
                               #
 if ( $html_quality < $xhtml_quality ) {
  return('xhtml') ;
 }

                              #  ------------------------------------------ — #
                              # — If, for some unfathomable reason, we      — #
                              # — arrive here, we return the default.       — #
                              #  ------------------------------------------ — #
                               #
 return($contentType) ;
}
   

Once you have determined which markup language to send, you need to do a proper transformation of the content. Replacing the Content-Type and DOCTYPE isn’t really enough — you need to change the XHTML syntax into HTML. This, luckily, is a small task if you have not utilised any of XHTML’s special features, such as namespace mixing — and as long as you have not used the extended structures of XHTML 1.1 such as Ruby, or XHTML 1.2 such as ARIA or ACCESS.

For the simple case you need only replace all occurrences of /> with >, all selected="selected" with selected, and all checked="checked" with checked.

For the more complex case you will need to replace the new structures with equivalent HTML ones — which often is difficult, since there exist no Ruby or ACCESS support in HTML — or disregard them, thereby losing out on structure and semantics.

In either case it is recommended that you use XSLT transformations, and cache the separate versions so that processing time is not significantly impacted. The above described simple replacements are, in general, a fragile solution.

Summary Content–negotiation can be a practical method by which to serve up HTML or XHTML depending on what is requested by the browser, but hinges on ignoring the special value */*

Recommendations

Disclosure

The author is a member of the World Wide Web Consortium’s XHTML Working Group. This article is not endorsed by either the W3C or the Working Group, although experience and knowledge gained from both have gone into the authoring.

References

Title Author Date
A Brief History of the Development of SGML SGML User’s Group June 1990
Authoritative Metadata Fielding, Roy T. et al April 2006
Assigned Numbers (STD 2, RFC 1700) Reynolds, J., Postel, J. Oktober 1994
As We May Think Bush, Vannevar July 1945
Cover Pages Technology Reports Cover, Robin July 2002
David Megginson’s SGML FAQ Megginson, David September 1998
Dropping the Normative Reference to SGML Ray, Arjun October 1999
Content–Type in HTTP 1.1 Fielding, R, et al June 1999
HyperText Mark–up Language Berners–Lee, Tim November 1992
Extensible Markup Language Bray, Tim; Sperberg–McQueen, C. M. November 1996
HTML 4.01: Conformance Ragget, Dave; Le Hors, Arnaud; Jacobs, Ian December 1999
Media Type Registration Procedure (RFC 1590) Postel, J November 1996
IE content–type logic Gupta, Vishu February 2005
Mozilla Web Developer FAQ: Sivonen, Henri May 2007
W3C Semantic Web Activity Berners–Lee, Sir Tim, et al July 2008
SGML Berners–Lee, Tim November 1992
4.9 SGML Exclusions in XHTML 1.0 Pemberton, Steven August 2002
well–formedness constraint in XML 1.0, 4th edition Bray, Tim et al September 2006
XHTML 1.0 Pemberton, Steven et al January 2000
XHTML 1.1 McCarron, Shane; Masayasu Ishikawa February 2007
XHTML 2.0 Axelsson, Jonny et al July 2006
XHTML 2 WG W3C September 2008
W3C June 2001
XHTML in Search Engines Dorward, David February 2008
C. HTML Compatibility Guidelines Pemberton, Steven et all January 2000
XHTML Media Types 石川 雅康 (Ishikawa, Masayasu) August 2002
XHTML Modularization 1.1 Austin, Daniel et al July 2006

Document Information

First published: 3rd of October 2008
Last update: 6th of October 2008
Prerequisite: Knowledge of HTML
Author: Tina Holmboe
Maintained by: Tina Holmboe