Skip to page content or skip to Accesskey List.
Search evolt.org
evolt.org login: or register

Work

Main Page Content

Automatic TOC Generation

Rated 3.72 (Ratings: 4) (Add your rating)

Log in to add a comment
(13 comments so far)

Want more?

 
Picture of mishoo

Mihai Bazon

Member info | Full bio

User since: October 27, 2002

Last login: October 27, 2002

Articles written: 1

Many one-page HTML documents need to have a simple Table of Contents (abbreviated to TOC for the rest of this article). Writing it by hand can be difficult to maintain after some time. To have an automatically generated TOC I see two solutions:

  • writing the document using a WYSIWYG editor that knows to generate a TOC. This has the unacceptable disadvantage that the generated code is usually atrocious and in some cases it does not even validate.

  • writing the document without a TOC and have the TOC automatically generated through the wonders of JavaScript and the DOM.

In this article we will discuss the second method. So the motto is: let the browser do it.

What we get

First, let's just assume that we already have the script and see how are we going to use it. Basically, we want to be able to write the BODY tag like this:

<body onload="generate_TOC('toc');">

And somewhere inside the BODY content, where we want the TOC to appear, we should have the following:

<div id="toc"></div>

This DIV will be known as the parent of the TOC.

With this setup, when a browser which haa proper support for JavaScript and the DOM loads our page, it will automatically generate the TOC inside the parent.

Problem overview

The generate_TOC function should walk through the document and remember all the headlines (<Hx> tags where x is 1, 2, …, 6). For each headline it should extract the text inside, create a link which has the headline as its target and add the link to the parent DIV.

The script will not handle styling. Styling can — and should — be done using external CSS, knowing the fact that the parent DIV has the ID "toc" and some other small details that we will discuss later.

Tasks

Here is the complete definition of tasks involved in TOC generation:

  1. Retrieving text. We will create a function that retrieves the text from an element. While this function should work with any element type, we will only use it for retrieving text from headlines.

  2. Finding the headlines. We will create a list of objects that contain a reference to the headline element, the text inside it and an integer variable representing the TOC item level (for proper indentation).

  3. Creating the TOC. For all items retrieved at step 2 we will:

    • Check if the headline element has an ID assigned. If not we will add an automatically generated ID. This will allow us to create a link that targets the headline.

    • Create a DIV element (D) that has the class "levelx", where x is the level of the TOC element (1, 2, ..., 6). This will help us style the TOC using CSS.

    • Create a link element that displays the text retrieved at step 2 for the current item and has the ID of the current headline as its HREF (remember, if the ID did not exist then we have generated it). Append this link to the DIV created above.

    • Finally, append the (D) DIV into the TOC's parent DIV.

The Code

Now, for the most interesting part of this article, the code.

The text retrieval function

For retrieving text from a simple heading like "&lt;H2&gt;Overview&lt;/H2&gt;" we could use simple code like "text = element.firstChild.data;". However, we want to be able to retrieve code even from complicated headings, like "

&lt;H2&gt;&lt;b&gt;This&lt;/b&gt; is a
&lt;em&gt;complicated&lt;/em&gt;
&lt;u&gt;heading&lt;/u&gt;&lt;/H2&gt;
". They just might contain more than simple text, therefore we need to have a recursive function that walks through all children elements and accumulates every little piece of text it finds. The code is given below.

function H_getText(el) {
  var text = "";
  for (var i = el.firstChild; i != null; i = i.nextSibling) {
    if (i.nodeType == 3 /* Node.TEXT_NODE, IE doesn't speak constants */)
      text += i.data;
    else if (i.firstChild != null)
      text += H_getText(i);
  }
  return text;
}

The only parameter to this function, el, is a reference to the HTML element from which we need to extract the text. We can get one, for instance, using document.getElementById().

A simple JavaScript object

JavaScript arrays are very powerful tools. However, they can only contain one value for each element. If we need to store more values inside, say, a[1], then we need a function that creates a new object containing all of these values and then we can store that object (or more correctly, a reference to it) in a[1].

Note that another good approach would be to just store another Array object, instead of a customized object. But using an object customized for our problem is more elegant, not to mention allowing better code readability.

Our object needs three properties: a reference to the headline element (el), the text inside it (text) and the item's TOC level (level). The code needed to create it is below.

function TOC_EL(el, text, level) {
  this.element = el;
  this.text = text;
  this.level = level;
}

Retrieving headlines

A very simple solution for this problem is to use document.getElementsByTagName(&quot;*&quot;), which should return all elements from the HTML document. Then, for each element, we compare the tagName attribute with "H1", "H2", etc.

However, this method does not work for Internet Explorer 5.0 because the browser doesn't properly understand the &quot;*&quot; parameter and returns no elements. Therefore we created the following function to deal with this problem. The function returns an Array object containing all the headlines found in the document, as objects defined in the previous section (thus also having the title and the TOC level). It should be called with a reference to the &lt;BODY&gt; element.

function getHeadlines(el) {
  var l = new Array;
  var rx = /[hH]([1-6])/;
  // internal recursive function that scans the DOM tree
  var rec = function (el) {
    for (var i = el.firstChild; i != null; i = i.nextSibling) {
      if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
        if (rx.exec(i.tagName))
          l[l.length] = new TOC_EL(i, H_getText(i), parseInt(RegExp.$1));
        rec(i);
      }
    }
  }
  rec(el);
  return l;
}

Some notes:

  • getHeadlines() contains a nested function. This function doesn't have a name, but we "assign" it to the variable rec so that we can call it. This kind of construct is very useful to avoid putting too many parameters or local variables inside the recursive function — it has access to variables defined in the containing function.
  • We are using a RegExp to test if the current element is a headline, that is, if it's tagName matches any of Hx, where x is 1 .. 6.
  • A somewhat confusing construct is used to push an element into the array: "l[l.length] = ...". That is because IE5 does not have the push method in the Array object. I recently found a good article that shows how to solve such problems at a more general level.

The generate_TOC function

This function is the main entry point into this script. It is intended to be called from the &lt;body onload=""&gt; handler with the ID of some &lt;div&gt; element (or anything else) that will be the parent of TOC.

It will construct the list of all headings, using getHeadlines(), then iterate through it and create elements inside the parent for each headline. The created elements will consist of a &lt;div&gt; having the class name "levelX", where X is the level of indentation (1 to 6) of that TOC entry. This will allow easy customization through CSS.

function generate_TOC(parent_id) {
  var parent = document.getElementById(parent_id);
  var hs = getHeadlines(document.getElementsByTagName("body")[0]);
  for (var i = 0; i < hs.length; ++i) {
    var hi = hs[i];
    var d = document.createElement("div");
    if (hi.element.id == "")
      hi.element.id = "gen" + i;
    var a = document.createElement("a");
    a.href = "#" + hi.element.id;
    a.appendChild(document.createTextNode(hi.text));
    d.appendChild(a);
    d.className = "level" + hi.level;
    parent.appendChild(d);
  }
}

For example, following is the HTML code that the above function generates for this page. As a side note, I used a handy feature of Mozilla: if you select something on the page then right click, in the menu that appears you see this option: "View Selection Source". It shows the HTML code for the selected block, as it is right at that moment — therefore, it also shows code generated by JavaScript-s.

<div id="toc">
  <div class="level1"><a href="#gen0">Automatic TOC Generation</a></div>
  <div class="level2"><a href="#wwg">What we get</a></div>
  <div class="level2"><a href="#gen2">Problem overview</a></div>
  <div class="level3"><a href="#gen3">Tasks</a></div>
  <div class="level2"><a href="#gen4">The Code</a></div>
  <div class="level3"><a href="#gettext">The text retrieval function</a></div>
  <div class="level3"><a href="#gen6">A simple JavaScript object</a></div>
  <div class="level3"><a href="#getheadlines">Retrieving headlines</a></div>
  <div class="level3"><a href="#generatetoc">The generate_TOC function</a></div>
  <div class="level2"><a href="#styling">Styling and indentation</a></div>
  <div class="level2"><a href="#gen10">Putting all together</a></div>
</div>

Styling and indentation

We can heavily style the TOC using external CSS and just knowing the ID of the parent DIV, and the fact that items of different levels will have different classes (starting from "level1" to "level6"). Indentation is also possible with CSS, so the script simply doesn't need to know how much to indent levels.

An example style is shown below. For a more fancy look you can check this page.

#toc {
  float: right;
  font-size: 80%;
  border: 1px solid #000;
  margin: 0px 0px 20px 20px;
  padding: 5px;
  background: #ddd;
}
#toc .level2 { margin-left: 1em; }
#toc .level3 { margin-left: 2em; }
#toc .level4 { margin-left: 3em; }
#toc .level5 { margin-left: 4em; }
#toc .level6 { margin-left: 5em; }

Putting all together

Usage is simple: just dump all these functions inside a ".js" file, load that file into your page that needs a TOC and do the simple setup described in section What we get.

To get a properly indented TOC you should also include a stylesheet, like the one above. Further customization is possible, i.e. different background / color for different TOC levels, or a fancy hover / active style for links inside #toc, etc.

More about mishoo in his personal home-page.

Mozilla Composer has an RFE about this

Submitted by glazou on November 18, 2002 - 03:56.

Hi Mishoo, I suggest you take a look at http://bugzilla.mozilla.org/show_bug.cgi?id=170050 My code does a bit more than your TOC generator since it provides UI to select the datasources of the TOC. I am also using DOM NodeIterator/TreeWalker instead of diving "by hand" into the document's structure. It also "saves" the configuration of the TOC into the TOC itself using an comment . Best regards, Daniel

login or register to post comments

I can't seem to be able to install TOCMaker

Submitted by mishoo on November 18, 2002 - 05:00.

Or, more exactly, I installed it as normal user, but can't find any menu to access it from. Then I switched to root and installed it again, but the result's the same.

login or register to post comments

Use meaningful HTML, ban the generic DIV

Submitted by theuiguy on November 18, 2002 - 14:32.

Although the basic idea expressed in this article is excellent, it breaks down in the implementation. Constructing this TOC with nested lists instead of DIVs would provide no less ability to style the results and has the benefit of proper structure built in.

login or register to post comments

Re: Use meaningful HTML, ban the generic DIV

Submitted by mishoo on November 19, 2002 - 02:36.

Why does it matter since people would not ordinarily see the generated HTML?.

For instance, a list has its own spacing (margin-left, margin-top, etc.). I use a spacing of 1em (something like spacing between paragraphs) between items of a list in my pages. Because that's how I like the lists to be.

If I would have used a list for constructing the TOC, I would have had to specifically remove this spacing, because I don't want TOC items to be separated by such a high space.

So the principle is this: I start from a basic element that almost has no default style -- it's a block element without margins, padding, etc. This makes it easier for me to style it as I want it to, i.e. I don't need to "remove" styles. Just need to add.

On the other hand, the TOC somehow resembles the structure of HTML itself: the sections (H1, H2, ...) in a document are not nested (although they should, in my opinion, like in SGML). So why construct the TOC using nested lists, since the structure of the document is plain?

login or register to post comments

WOW !!!

Submitted by glazou on November 19, 2002 - 15:22.

Wow, saying that lists are nested in SGML shows a very serious lack of knowledge about SGML... Most SGML dtds have nested sections but not all of them. If you did not write that comment too fast, you are making a confusion between SGML, the meta-language for the DTDs and the instances conformant to the DTDs, and the DTDs themselves.

login or register to post comments

Re: WOW !!!

Submitted by mishoo on November 20, 2002 - 02:42.

I didn't mention anything about lists in SGML. I said about document sections: (i.e. in DocBook DTD) you usually have <chapter>, and a <chapter> can have <section> which can also have <section>... -- they are nested.

But you are right -- I didn't mean SGML; I meant, for instance, the DocBook DTD. Sorry for creating confusion..

login or register to post comments

Nice script

Submitted by ppk on November 21, 2002 - 04:53.

Nice script overall. I need such a script myself and wondered about how to translate document.getElementsByTagName('H*') to something the browsers would understand. The regexp is an elegant solution.

The only objection I have is the complicated stuff for reading out complex headlines which contain tags. A simple innerHTML call would also do the trick and is much simpler and faster than your script.

login or register to post comments

Re: Nice script

Submitted by mishoo on November 21, 2002 - 06:09.

Yep, but, for a tag a bit more complicated, like the one that I showed as an example, the innerHTML would yeld:

&lt;b&gt;This&lt;/b&gt; is a
&lt;em&gt;complicated&lt;/em&gt;
&lt;u&gt;heading&lt;/u&gt;

I was assuming that everyone would be interested in the text itself, without any markup (at least that's what I needed. The function H_getText is actually very simple and for a tag like the one above would yeld "This is a complicated heading" -- thus it would ignore the markup. I found no simpler solution to do that than to recurse through the element as needed.

login or register to post comments

Re: Nice script

Submitted by ppk on November 21, 2002 - 08:42.

I was assuming that everyone would be interested in the text itself, without any markup (at least that's what I needed

OK, now I understand. Too bad innerText is not supported by Netscape 6.

login or register to post comments

very nice, but

Submitted by fished on November 24, 2002 - 04:20.

And somewhere inside the BODY content, where we want the TOC to appear, we should have the following: <div id="toc"></div> You'd want to add this via DOM too because it's useless having an empty DIV there for old browsers.

login or register to post comments

Re: Mozilla Composer has an RFE about this

Submitted by wazungu on November 25, 2002 - 11:40.

This page is interesting, although I don't know enough about XUL and JS to translate this to a script I can use on a web page. The example provided looks like an extension to Mozilla/composer; how do we translate this to use in an on-the-fly TOC generator for web pages? I'm sure there is a way because it uses JS, just not sure how to take it to the next level. I would like to hear more from Daniel if possible. Thanks!

login or register to post comments

List or DIV's?

Submitted by wazungu on November 25, 2002 - 12:35.

I would have to concur with the earlier statement about using list items instead of a bunch of DIVs to populate the TOC. It may be more "work" to format a list item, but if we care about using XHTML then we care about properly structuring our markup. Presentation should be handled by CSS (as you have suggested), and list items can be formatted the way you want.

login or register to post comments

Re: List or DIV's?

Submitted by mishoo on November 26, 2002 - 01:54.

I tried this script yesterda in an XHTML page and it looks awful. I have to mention that it only happened when I passed the content-type as text/xml (as opposite to text/html). Does anyone know why?

login or register to post comments

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.orgEvolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.