This is a legacy document, and retained on the site in order to avoid link rot. The content is likely no longer (a) accurate, (b) representative of the views and philosophies of current site management, or (c) up to date.

Markup vs Tag soup

Table of contents

Toby and Arjun posted to ciwah

Where it started

 Anonymous wrote:|| Toby Speight wrote:| > You're falling into the| > embedded-formatting-commands model of HTML,| > which causes all sorts of confusion.|| (The what? please explain!)

The idea that a tag constitutes a command or directive , or in general by itself conveys semantic information. It doesn't.

A tag is just a marker, like a parenthesis: its role is syntactic only, to group with its matching parenthesis the contents in between, where the block as a whole is given a name for semantic purposes.

Such a name, called the generic identifier , is included in the start-tag because it's syntactically convenient to do so, but this name is not a property of the tag . It's a property of the element for which the tag serves as a delimiter. In other words, all semantics apply to elements : the tags merely locate where these elements are.

Similarly, the scope of the element determines the scope of the semantics to be applied, so proper semantic processing depends critically on such scopes being identifiable.

Markup «--» Tag soup

The syntax to scope an element has three parts: a start, content, and an end. The content in turn consists of other elements (also with generic identifiers as "hooks" to semantics) and possibly text. The elements form a containment hierarchy that is actually a tree, as a data structure, with all text data at the leaf nodes. In fact, a document with SGML markup is no more than a linearized representation of such a tree, where all text is embedded in markup .

In terms of semantic information content, these representations are exactly equivalent:

 1.<HTML><HEAD><TITLE>Example</TITLE></HEAD><BODY><H1>Hello World!</H1></BODY></HTML>2.<HTML><HEAD><TITLE>Example</></><BODY><H1>Hello World!</></></>3.((HTML)((HEAD)((TITLE) (Example)))((BODY)((H1) (Hello World!))))

#1 is a normalized form with all omissible tags included.

#2 happens to be valid HTML.

#3 makes the tree more evident at the expense of inconvenient syntax.

But #1 is best understood in terms of #3.
» » »

For further info on this, see

Contrast this paradigm of, all text embedded in markup , with another where all markup is embedded in text instead. Here, each markup construct is intended to function by itself, as an independent unit of semantic information. The data structure underlying this is a linked list with two types of nodes, text and markup , in arbitrary order. A linearized representation is basically trivial; semantic processing is similarly well suited to "stream mode", do something one tag at a time , and no tag no action .

It is possible to treat the linear representation of a tree in a tag at a time mode (if only to reconstruct the tree), but neither do lists correspond to trees in general nor does the mere fact of a linear representation per se, cancel or invalidate the tree and thus mandate an a tag at a time approach.

Markup syntax

The surface syntax of HTML doesn't clarify which paradigm should apply. Either one could. For instance, it could be argued that all tags are between '<' and '>' , and end-tags are distinguished by the presence of a cancellation operator , '/' , so that </FOO> parses as…

 </FOO> :: < + /FOO + > or…</FOO> :: {'<' {'/' 'FOO'} '>'}

…neither of which reflects the correct parse…

 </FOO> :: </ + FOO + >

There is plenty of rubbish like this on the Web; it seems to be a theory many people are prone to fall into. More on this below.

The fact is that the only existing formal specification for HTML identifies the correct paradigm; a HTML document is a tree of elements with text, not a list of tags and text. But there's another giveaway, which has to do with characteristic usage in the two paradigms.

With a tree of elements, the structural relation of containment -- an essentially descriptive function -- is built into the syntax, and the semantics come in as a late binding of element names to procedures.

The names are just identifiers, and so by and large they tend to be nouns . With a list of tags, there's no further structure beyond the sequencing, and so the names -- in terms of what they're supposed to convey -- are typically verbs .

This is a characteristic difference…

…between descriptive markup such as…

 <List> <Item>Item 1</Item> <Item>Item 2</Item></List>

…and procedural markup such as…

 <IndentIn> <Bullet>Item 1 <Bullet>Item 2<IndentOut>

Unfortunately, HTML doesn't help much here. Many of its elements have utterly impenetrable names, such as UL , OL , LI , TR , TD , DL , DT , DD , etc. etc.

For all anyone might care to know, UL could easily be Indent in Sanskrit, and LI could be Bullet in Swahili. But, in general, reading the HTML specification should make it clear that HTML is largely about nouns, not verbs . But as most people don't bother with specs…

Ahem. Sanskrit? Swahili? Why not Mosaic?

It so happens that the current crop of browsers are procedural markup processors: their MO is basically one tag at a time , with contortions to handle problems such as two-pass parsing of tables (so it's no surprise that they make heavy weather of getting such things right.)

Faced with figuring out what UL means, someone might just look at what Mosaic did with it, and conclude that the <UL> was a command that Mosaic seemed to obey dutifully -- and consistently in the sense that the same observable result was independent of context. In fact, it was probably extremely fortuitous that HTML elements were named so obscurely;

because there's nothing obviously wrong in believing that…

 <UL> Indented stuff</UL>

…is actually just computerese for…

 <IndentIn> Indented stuff<IndentOut>

…whereas something like…

 <List> Indented stuff</List>

…might just give pause for thought about the plain meanings of words.
» » »


When Toby mentioned the embedded formatting commands model , he was referring to the tendency to think of tags as (angle brackets around) verbs. It's a theory; obscure names help; it seems superficially valid since the Mosaic spawn have precisely that implementation strategy; but the specs say otherwise.

Aside: IMHO, the prospects for good CSS implementations in the Mosaic spawn are poor. Stylesheets also constitute a kind of late binding of properties to elements, and the inheritance model relies on that in critical ways. But the Mosaic spawn prefer to deal with tags one at a time in isolation, which is why their tag-salad parsers have to be helped along with explicit endtags quite often. No surprise there.

 Anonymous wrote:| As I asked in another post, what about:|| ..bla<i>text1<b>text2</i>text3</b>bla..|| How does that survive your cute little| "nesting" theory?

Well, it isn't Toby's nesting theory. It's the paradigm of hierarchic structure in all SGML applications. The trouble is in confusing it works for me with it's what I meant , which really was…

 ..bla<i-on>text1 <b-on>text2<i-off>text3<b-off>bla..



The idea that a tag constitutes a command or directive, or in general by itself conveys semantic information. It doesn't.

Jan Roland Eriksson