Wednesday, June 23, 2021

An Overview of the W3C HTML5 Document Outliner Algorithm

HTML4’s use of div and header tags to describe a document’s structure has many limitations. First, the div tag acts as a generic block level division. That’s all fine and well until you start to have nested DIVs. Moreover, without a descriptive ID or class attribute, there’s no way to know whether its function is primarily one of presentation style or semantics.

With regards to headers, it is not possible to describe a subtitle or secondary title in HTML4. Since every section is part of the document outline, there is no way to define a section containing information related to the site as a whole, like logos, menus, table of contents, etc. HTML5 now introduces several new elements to describe the structure of a web document with a standard semantics.

This article specifically focuses on HTML5’s Header and Section elements and describes how to use them to define the desired outline for your documents.

Why New Tags?

The definitive new structural elements include header, hgroup, article, section, aside, footer, and nav. Others have still not solidified their standing in the spec. As new tags, they are meant to compliment the existing DIV and header tags and not replace them. Their role is to help organize our page content according to what the content is. Hence, it is not about “where” the content goes on the page, but rather “what” the relationship is between the content with respect to other page content.

When dealing with existing content, start with a bird’s eye view of the page and gradually work your way to more internal content. A good starting place is to divide content by sections and headers.

The Document Outline

Traditionally, a document’s outline has been defined by the section headers, whereby the first element of heading content (and tag) in an element of sectioning content represents the heading for that section. Subsequent headings of equal or higher rank start new (implied) sections, headings of lower rank start implied subsections that are part of the previous one. In both cases, the element represents the heading of the implied section.

Sounds simple enough, however, the inclusion of the newtag complicates matters a bit, because the rank of a section depends partially on how the tag is used. For instance, section elements are always considered subsections of either their nearest parent section tag or their nearest ancestor section element, depending on which of the two is closest. This outline also supersedes any implied sections other headings may have created. The lesson here is that it’s best to explicitly wrap sections in elements of sectioning content, and not just rely on the implicit sections generated by the headings.

As a general rule, the W3C strongly recommends to use only H1 elements, or to use the header of the appropriate rank for the section’s nesting level.

To help you get your document outline right, there is an implementation of the W3C’s Outliner algorithm in the HTML5 outliner (h5o) at Github. It is available as a Chrome extension, a Bookmarklet (Limited version for IE), a very early experimental Firebug extension, and as a minified JavaScript. I applied the JS version to the home page by calling it from the window.onload() event as shown here:

Shared Web Workers Help Spread the News
<script type="text/javascript"><!--mce:0--></script>
<script type="text/javascript"><!--mce:1--></script>

The boolean argument to asHTML() tells the function whether or not we want links to the sections in the document. Here is a portion of the HTML produced (I added the indentations):

<li><a href="#h5o-1"><em>No text content inside H3</em></a></li>
<li><a href="#h5o-2">Two dead in Virginia Tech shooting, suspect on loose</a></li>
<li><a href="#h5o-3">Attawapiskat consultant to be paid $180,000</a></li>
<li><a href="#h5o-4">Gallery: Actors with the most bang for the buck</a></li>
<li><a href="#h5o-5">Deer and ram in love pose ethical dilemma</a></li>
<li><a href="#h5o-6">Gallery: Crazy cool Christmas lights</a></li>
<li><a href="#h5o-7">Virginia Tech</a></li>
<li><a href="#h5o-8">Attawapiskat</a></li>
<li><a href="#h5o-9">Profitable Actors</a></li>
<li><a href="#h5o-10">Science</a></li>
<li><a href="#h5o-11">Christmas lights</a><ol>
<li><a href="#h5o-12">Headlines</a></li>
<li><a href="#h5o-13">Canal Killings</a></li>
<li><a href="#h5o-14">'My children did a lot of cruelty toward me': Shafia</a><ol>
<li><a href="#h5o-15">Virginia Tech</a></li>
<li><a href="#h5o-16">2 dead in Virginia Tech shooting, suspect on loose</a><ol>
<li><a href="#h5o-17">Albert Pujols</a></li>
<li><a href="#h5o-18">Albert Pujols heading to Angels</a><ol>
<li><a href="#h5o-19"><em>No text content inside H3</em></a></li>
<li><a href="#h5o-20">Markets drop as ECB disappoints</a><ol>
<li><a href="#h5o-21">Today's Photos &raquo;</a></li>
<li><a href="#h5o-22">Popular Links</a></li>
<li><a href="#h5o-23">The Daily Bright &raquo;</a></li>
<li><a href="#h5o-24">MythBusters</a></li>
<li><a href="#h5o-25">Video: Cannonball hits home in 'MythBusters' TV shoot</a><ol>
<li><a href="#h5o-26">Border deal</a></li>
<li><a href="#h5o-27">What Canada-U.S. border deal means</a><ol>
<li><a href="#h5o-28">Television</a></li>
<li><a href="#h5o-29">Will Ryan Seacrest replace Matt Lauer?</a><ol>
<li><a href="#h5o-30">Metallica</a></li>

…which renders the following in a browser:

  1. No text content inside H3
  2. Two dead in Virginia Tech shooting, suspect on loose
  3. Attawapiskat consultant to be paid $180,000
  4. Gallery: Actors with the most bang for the buck
  5. Deer and ram in love pose ethical dilemma
  6. Gallery: Crazy cool Christmas lights
  7. Virginia Tech
  8. Attawapiskat
  9. Profitable Actors
  10. Science
  11. Christmas lights
    1. Headlines
    2. Canal Killings
  12. ‘My children did a lot of cruelty toward me’: Shafia
    1. Virginia Tech
  13. 2 dead in Virginia Tech shooting, suspect on loose
    1. Albert Pujols
  14. Albert Pujols heading to Angels
    1. No text content inside H3
  15. Markets drop as ECB disappoints
    1. Today’s Photos »
    2. Popular Links
    3. The Daily Bright »
    4. MythBusters
  16. Video: Cannonball hits home in ‘MythBusters’ TV shoot
    1. Border deal
  17. What Canada-U.S. border deal means
    1. Television
  18. Will Ryan Seacrest replace Matt Lauer?
    1. Metallica

Don’t be surprised if the links don’t do anything–the content that they refer to is not in this document, and even if they were, the function is non-obtrusive in that it only links to IDs that already exist in the document. Hence, if the Metallica section possessed an ID attribute, the link would point to it. Since it does not, the H5O algorithm generates its own, but does not insert it into the DOM. The generated link ID is in the format of “‘h5o-‘+ (++linkCounter)”, giving Metallica a section ID is #h5o-30.

The inclusion of links certainly is a great tool for generating a table of contents. Just be sure to assign IDs to each section if you want to save yourself a bit of work.

Headerless Sections

Sections that do not contain a child heading will be labeled as an “Untitled” section, as to still preserve the outline, as seen in the code below:

<h1>Shared Web Workers Help Spread the News</h1>
<p>After being a fixture in languages like Java for years, Web Workers have now made multi-threading in Web applications a reality. Right now, they are supported as of Opera 10.6, Safari 4.0, Chrome 11.0, Firefox 4.0 and are expected to be included in IE 10&hellip;</p>
<h2>The Difference between the Two</h2>

…which renders the following in a browser:

  1. Shared Web Workers Help Spread the News
    1. Untitled SECTION
    2. The Difference between the Two

Rules for Header Groups

The outliner will disregard all headings within except for the one with the highest ranking. For example, if it contains an <h1>, an <h2>, and an <h3>, only the <h1>’s text will be used as the section title in the outline. Thus:

<h2>It's all about the music.</h2>
<p>Rob voted best guitarist of 2011&hellip;by his wife.</p>

…would produce the following outline:

                1. News

For more information, please visit the W3C markup specification.

Limitations of HTML5 Document Sectioning

One thing that HTML5 does not include is a mechanism that would allow semantic information to be added to a document as required. So for the time being, we’ll have to make the current set of new tags work for us. Hopefully these will eventually evolve into a quasi-language of their own, much like CSS did. If and when that ever happens, the HTML5 Document Sections will live up to their promise to do something similar with page semantics just as CSS radically changed how we defined the look of our web pages.

Robert Gravelle
Robert Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Popular Articles