Search evolt.org

Work

# Four ways to work with hierarchical data

#### Kirby Fling

User since: December 05, 2000

Articles written: 1

Many thanks to RatFace for his article BBS style recursion How-To. I have also recently been wrestling with the concept of displaying heirarchical lists, specifically in a discussion forum tool that I've been writing.

I've spent quite a bit of time working out the data structure and display methods for the concept of heirarchical data and I've come to the conclusion that there are four main programming methods to approach the problem:

1. Recursion
2. Stack
3. Flat Table
4. Modified Preorder Tree Traversal Algorithm

So let's begin our discussion of each method:

## Recursion

To many computer science types, recursion is frequently found to be the most "elegant" method to drill down through a hierarchical list of data.

And I would have to agree with that summation, but then again, that might be my first year of computer science speaking. ("Welcome to CS 101: Recursion until your head explodes")

The reason it is such an "elegant" solution is that you essentually have one function that does all of your data processing, display, and iteration to possible children. You call the function once (with the root of the tree) and everything is done. The added benefit is that you can call the function with any subnode of the tree and it will display the entire subtree without an additional logic or coding effort. In pseudo-code, a hierarchical list display function would look something like this:

```function DisplayChildren(uid, indent_level)
SELECT id, (info)
FROM (db)
WHERE ParentID = uid

if recordcount > 0 then
(indent by indent_level) (HTML to display info for this child)
child_id = recordset.id
DisplayChildren(child_id, indent_level+1)
end if
end function
```

That's really all there is to the recursive method, which is why it's seen as such an elegant solution.

The problem is that the vast majority of coding solutions we have at our disposal handle recursion very poorly. The idea is that every function (ideally) gets executed in its own memory space. And have to leave cleanup to whatever middleware we are using. So let's do a quick little check on efficiency with arbitrary numbers.

Let's take the following to be true (just for example):

• Each time we get one record from the DB, it takes `n` seconds.
• Each DB connection/disconnection takes `n * 2` seconds.
• Allocating memory for a new incidence of the function takes `n/10 * iteration` seconds (in ASP, and Cold Fusion, recursive algorithms have demonstrated a logorithmic decay in performance).

So, if we had 1000 messages in our discussion forum, the equation would look something like this: `(1000 * n) + (1000 * n * 2) + (n/10 * iteration)`. Take a look at the following chart for better clarification:

```Record    1:   3 * n seconds
Record  250:  29 * n seconds
Record  500:  53 * n seconds
Record  750:  79 * n seconds
Record 1000: 103 * n seconds
----------------------------
Total = approx 50,000 * n seconds
```

As you can see, that logorithmic decay really ends up biting in the end. And granted, the numbers I used are just for the sake of example, but you can see that there is a fundamental issue of excess over head with the recursion model. It is a very elegant model, it fits in your head very well, and in the code even better, but unless the system you are using is designed for recursion (e.g. scheme, lisp, etc), you will always run into excessive overhead when the numbers start getting huge.

## Using a Stack

The stack is usually the second method of approaching large, crazy lists. There are 2 types of stack programming: First In First Out and First In Last Out. When working with heirachcical lists, you have to use the FILO model. The idea is that you create a stack somewhere, which is essentially just a list. Then add the id of the record you are currently working with, and when you are done with that record, you pull it off the list. If that record has any children, then you start working with the first child and put it's id at the end of the list. etc.

Much like the recursion model, the data looks something like this:

UIDint, numeric, etc.nothe unique identifier for each record
name, title, etcvarchar(x)yesthe name or title of this record
ParentIDint, numeric, etc.yesthe UID of the parent to this object, NULL for the top level of the tree

But now the question is whether you want to do everything in the middlewareor closer to the data in an SQL stored procedure.

1. you can display your HTML inside each iteration.

1. just as many DB connections being created as in the recursive model

Advantages of running the stack with a stored procedure:

1. executes closer to the data, so it's faster
2. gives you a compiled temp table that needs little effort in the middleware
3. only one connection to the database

1. you might not be able to run stored procedures on your DB system
3. more difficult to code/read/manage it it's remote and in a different language

Here is the code for a hierarchical stack algorithm that is outlined in the Microsost SQL 6.5 documentation as an example:

```CREATE PROC expand (@current char(20)) AS
SET nocount on
DECLARE @level int, @line char(20)
CREATE TABLE #stack (item char(20), level int)
INSERT INTO #stack VALUES (@current, 1)
SELECT @level = 1
WHILE @level > 0
BEGIN
if EXISTS (SELECT * FROM #stack WHERE level = @level)
BEGIN
SELECT @current = item
FROM #stack
WHERE level = @level
SELECT @line = space(@level - 1)  @current
PRINT @line
DELETE FROM #stack
WHERE level = @level
AND item = @current
INSERT #stack
SELECT child, @level  1
FROM hierarchy
WHERE parent = @current
if @@rowcount > 0
SELECT @level = @level  1
END
else
SELECT @level = @level - 1
END
```

You can modify the concept in this code to fit your particular middleware.

Now let's take a look at the theoretical efficiency numbers like we did in therecursive model:

Let's take the following to be true (just for example):

• Each time we get one record from the DB, it takes `n` seconds.
• Each DB connection/disconnection (from middleware) takes `n * 2` seconds.
• Time to execute the list/stack operations is almost nonexistant
• Time to create the temporary table in the stored procedure takes `n / 2` seconds

So, we have the same 1000 messages in our discussion forum, the equation would look something like this:

Running the Stack with middleware: `(1000 * n) + (1000 * n * 2) = 3000 * n`

Running the Stack with a stored procedure: `(1000 * n) + (2 * n) + (1000 * (n / 2)) = 1502 * n`

## Flat Table Model

While I was building my discussion forum, I immediately went for the recursive model to get my data and display it on the screen. I successfully built my forum in both Cold Fusion and ASP. It worked in both, but I wasn't very happy with the performance of either. I decided to investigate other ways of turning a sequence of parent-child relationships into a hierarchical list of data. I started playing with the stack model, and eventually discovered the stored procedure method, and while I was working with that, I had this little flash of insight:

A flat table will always yield the fastest query.

So I started asking myself "How do I turn the data I've got (or am about to get) into a flat table?" As it turns out, it's really quite easy. Take the example of a discussion forum. What information is really on the page? If the information looks like this:

 "hello world" johnDoe 12/2/00 "Re: hello world" janeDoe 12/4/00 "How is your foobar?" johnDoe 12/3/00 "my foobar is fine" sysadmin 12/2/00 "hello yourself" jackDoe 12/2/00 "my app RuLeZ" idiotboy 12/1/00 "learn how to spell" jacksnot 12/2/00

Everybody sees that there is a subject, author and date. But the indenting on the left is usually seen as the parent-child relationship. "my foobar is fine" is a child of "How is your foobar?" etc. But you can also look at it apart from the parent-child relationship and see it as:

1. the order in which to be displayed
2. the level to be indented

So, we can contruct the data schema to look something like this:

UIDint, numeric, etc.nothe unique identifier for each record
name, title, etcvarchar(x)yesthe name or title of this record
ParentIDint, numeric, etc.yesthe UID of the parent to this object, NULL for the top level of the tree
rank (or display_order)int, numeric, etc.yesthe order to display all records
indent_levelint, numeric, etc.yeshow much to indent this item

Now, how do we work with this data structure once it's built? The selection of the data is easy:

```function DisplayChildren()
SELECT id, (info), indent_level
FROM (db)
ORDER BY rank

while the recordset isn't empty {
(indent by indent_level) (HTML to display info for this child)
}
end function
```

But, as easy as that is, the difficulty is now shifted to the point of insertion. The indent level is easy, it is always (parent.indent_level + 1). The display_order (or rank) is a little more challenging. The rank of the new child needs to be greater than it's parent, but less than any of it's children or the next sibling at the same level. In the example above a new child of "hello world" should have a rank of 2. But if they were all ranked from 1 to 7 to begin with, that means that you need to increment the rank of all messages whose rank is greater than 1. Which just happens to be the parent of the child we are inserting. So you end up calling a stored procedure that looks something like this:

```CREATE PROC increment_rank (@start_rank int) AS
BEGIN
UPDATE (table_name)
SET rank = rank + 1
WHERE rank > @start_rank
END
```

Note: you can do this in middleware if you need, but it's such a simply query that will gain such benefit from a pre-compiled query path that a stored procedure is the best way to go.

This solution ends up with the biggest database hit at the point of insertion, but even that hit isn't terribly bad, and the lessening of the hit for each view of the folder is VERY significant. Let's go back to our efficiency numbers for a moment:

Let's take the following to be true (just for example):

• Each time we get one record from the DB, it takes `n` seconds.
• Each DB connection/disconnection (from middleware) takes `n * 2` seconds.

In order to view all of of our 1000 example messages, the equation looks like this: `(1000 * n) + (2 * n) = 1002 * n`

But in order to be fair, we also have to look at our numbers for the insertion of a given message...

Let's take the following to be true (just for example):

• Each time a record is inserted, it takes `n * 2` seconds.
• Each time a record is updated, it takes `n * 2` seconds.

So, for both the recursive and stack models, the cost of insertion is simply: (connection + insertion) `n * 2 + n * 2 = n * 4 `

But for the flat table model, the cost of insertion is something akin to: (connection + updating 1-1000 records + insertion) `n * 2 + (n * 2 * x) + n * 2 = [6n to 1004n]`

As you can see, the cost of insertion is MUCH higher for the flat table model, but even the highest cost of insertion is not as high as the fastest viewing method that we've discussed thus far (the stored procedure/stack model @ 1502 * n). And the viewing speed is 30% faster. So if you expand the numbers to include some kind of usage metric, with an estimated 8,000 page views a month, and an estimated 300 messaged inserted a month. The flat table model is the hands down winner..... for a discussion forum....

Which brings us to our disadvantages to the flat table model. It isn't easily ported to a lot of other uses. With a discussion forum, you almost always want it sorted hierarchically, then by date descending. And the rank-increment method works perfectly for this application. If you wanted to sort it by date ascending, it is still relatively easy (insert the child before the next message with the same parent id). But for other large lists of information that needs to be displayed hierarchically, it's very difficult to get them to display properly without some kind of re-ordering interface that you can expose to whoever is in charge of the information. In my experience, that is almost always the case anyway so the flat table model is once again a viable option for your hierarchical list pages.

## A Modified Preorder Tree Traversal Algorithm

To be completely honest, I've never used this method for working with hierarchical data, and I don't even understand it that well. I read about in a book called SQL for Smarties where Joe Celko devotes 2 entire chapters to working with hierarchical data in databases that aren't really designed to. I figured I should at least make a reference to this method since it purportedly has so many great advantages to the traditional parent-child relationship model.

The basic idea is that you start at the right-most branch of the root of your tree and give it a 1 where the relationship starts and a 2 where it ends. Then you take that node's right-most relationship and give it a 3 where it starts and a 4 where it ends. So that node ends up with 2 numbers: left and right. Decrement the "left" and you get the "right" of it's parent, increment the "right" and you get either it's child or it's sibling at the same level. Continue that idea all the way around the tree until you are back at the root again. Each object ends up with 2 numbers and from those number you can run some fun queries and come up with lots of information about that particular node, the tree in general, paths between different nodes and lots of other stuff. But as I said, I don't fully understand it yet. Maybe when I do, I'll write another article.

Here are some more links I've found on tree traversal:

Submitted by ironclad on December 7, 2000 - 22:40.

I don't know where this fits into the bigger picture, but one way I solved the problem of sorting was by storing into each entry its lineage as a long text string, and then sorting by that field.

IDSubjectLineage
001 Hello World 001
002  Goodbye Cruel World 001-002
006    Skool sukz 001-002-004-006
008     Grow up 001-002-004-006-008
007   Cheer up 001-002-004-007
003 Foo was here 003
005  Kilroy was there 003-005

Although the messages were created out of strict sequence, you can safely assume that message #006 was created after #004 (unless you have some weird recordID thing going). Thus, to put all the messages into correct sorted order you just sort by the lineage.

When creating a new node you can optimise the calculation of the lineage by realising you don't need to traverse the entire ancestry - you just ask your poppa for the history.

I'm guessing this method is more suited to wider rather than deeper hierarchies, which most discussion forums are. Also, given the social dynamics of discussion forums its a good bet that the records are already close to being sorted correctly, what with replies all occuring after their parents, and replies likely to be near in time to those parents.

This approach also facilitates displaying a sub-branch at any time. For example, search for lineage starting with 001-002-004 for the entire "why so sad" thread.

Nonetheless, given the usual ratio of reading vs posting, it still probably better to simply re-sequence the lot and not sort on the fly. Note too the social dynamic of replies being near in time also lessens the impact of your update routine.

#### Using Objects and Recursion

Submitted by avatraxiom on November 21, 2002 - 21:13.

There's also another option. Well, it's really just a variation on the first option you gave. Do one query, and the use recursion to load all of the data into an object. You still have to use recursion, but if you come up with some clever object (some sort of clever self-referencing structure, for example, in ColdFusion), then you only have to QUERY once. (And the query seems like the biggest overhead.) -M

#### Just wondering something about recursion

Submitted by liorean on April 30, 2003 - 10:33.

How large is the difference if you use a recursive language such as LISP versus serial languages such as C/C++ or Java?

#### Great way!

Submitted by cebtaur on February 9, 2004 - 01:47.

I found ironclad's post very useful for myself, almost as useful as the article itself. Indeed, if you need to have just a plain hierarchy of all forum postings etc just sort all the entries by LINEAGE - and you are done. I am mpressed and I am going to implement it tonight. Thanks, guys!

Submitted by born_confused on June 15, 2004 - 15:24.

http://www.sitepoint.com/article/1105 heres a great link for working with trees to store hierrarchical data also FILO = stack, FIFO = queue

#### Another Solution - textual heiarchy (paths)

Submitted by regx on November 20, 2004 - 05:16.

There is another very simple solution although it flies in the face of normilization. If you don't mind bending the rules a bit for the sake of performance and simplicity. One of the most common needs for treelike structures in sql is categories say for a shopping site. A simple solution is to store the path (heiarchy) and the name in seperate columns Of course this requires one not to allow the delimiter used for the path to be entered as part of the category name but that is common sense So your table might look like this
 001 electronics / 002 televisions /electronics/ 003 mp3 players /electronics/ 004 ipod /electronics/mp3 players/
etc. The disadvantages: 1. You have to update related categories when renaming a category, deleting etc - but both can be done with a single query regardless if we want deleting to remove all child categories, or to simply bump them to the next available parent. 2. Since we are using text we have to use proper escaping especially if we use the path in urls The Advantages: 1. When looking at the raw tables the relation ship is visible 2. No recursion or self joins are needed 3. To create a breadcrumb you can simply split on the / and build the urls by traversing an array 4. If you use the path in the url, then your log files make sense as well (it is easy to use ids though with this method) 5. Some search engines give higher priority based on urls so having the category path can be beneficial there as well 6. To get the level for tabbing you can simply count the delimiters To relate items to categories use the ID so you are only breaking normalization for the categories. I know this is breaking the rules, but it is a very simple and fast solution. I would be happy to post source code examples - If I survive the retaliation from this post :) For threaded items such as forums I recomend ironclads' solution.

#### eye-opener

Submitted by thedane on December 3, 2004 - 16:17.

Just found the entire discussion about storing and viewing tree information, and even if it's not a current issue for me, I couldn't stop reading. This article, along with ironclad has been one major eye-opener for me. thanks for your effort guys :-)

#### Sorting by name or whatever?

Submitted by thany on November 16, 2005 - 19:46.

I've found out the hard way that a hierchial structure that allows for retrieval with a single query doesn't allow sorting the nodes by name, date, or whatever you want. This is because the nodes are already sorted on a column that has unique numbers in it, which is used to represent a hierchial structure.

So in the end, the only way to get sorting working is to use the adjacent list model. I'll have no choice but to use recursion, or use the WITH-clause when I switch over to MSSQL 2005 (altough the WITH-clause also causes recursion, but internally)

#### Materialized path

Submitted by mmstud on March 31, 2006 - 17:12.

I made a small approach to handle tree structure in database through materialized path with this: http://www.phpclasses.org/browse/package/2981.html It is possible to sort data by name or any field using order clause in sql statement. Sorting is based on simple logic: "ORDER BY path ASC, field ASC" where second field is used to make final sorting.

#### One or two addendums to lineage (for the beginners like myself)

Submitted by omi on April 27, 2006 - 23:06.

The first question I asked myself after finding Ironclad's elegant solution is how I create the lineage string to use within queries for the sorting of threads.

Originally, I tried to add the uid of the new node to the parentage string at insert. But determining the ID of the new entry was difficult because MySQL determines that part last during an INSERT. Then I tried a transaction where the INSERT was immediately followed by an UPDATE that used the LAST_INSERT_ID() function. But that seemed wrong and too potentially vulnerable to error.

Then I realized it could be done simply. Each entry doesn't store its lineage as a single string. It just stores its parentage. Then, in the SELECT statement, you manufacture the lineage string.

Let's look at a variation on Ironport's table. I've used 0 as the parentage for a top-level comment.

<table width=350 border=1>
<tr><td width=30 align=left><b>ID</b></td><td width=150 align=left><b>Subject</b></td><td width=170 align=left><b>Parentage</b></td></tr>
<tr><td align=left>001</td><td align=left>Hello World</td><td align=left>0</td></tr>
<tr><td align=left>002</td><td align=left>&nbsp;&nbsp;Goodbye Cruel World</td><td align=left>0-001</td></tr>
<tr><td align=left>004</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;Why so sad?</td><td align=left>0-001-002</td></tr>
<tr><td align=left>006</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Skool sukz</td><td align=left>0-001-002-004</td></tr>
<tr><td align=left>008</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Grow up</td><td align=left>0-001-002-004-006</td></tr>
<tr><td align=left>007</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Cheer up</td><td align=left>0-001-002-004</td></tr>
<tr><td align=left>003</td><td align=left>Foo was here</td><td align=left>0</td></tr>
<tr><td align=left>005</td><td align=left>&nbsp;&nbsp;Kilroy was there</td><td align=left>0-003</td></tr>
</table>

Now, using a CONCAT, we create a lineage during the SELECT.

SELECT ID, Subject, CONCAT(parentage, '-', ID) as Lineage from 'tablename' ORDER BY Lineage ASC

We get...

<table width=350 border=1>
<tr><td width=30 align=left><b>ID</b></td><td width=150 align=left><b>Subject</b></td><td width=170 align=left><b>Lineage</b></td></tr>
<tr><td align=left>001</td><td align=left>Hello World</td><td align=left>0-001</td></tr>
<tr><td align=left>002</td><td align=left>&nbsp;&nbsp;Goodbye Cruel World</td><td align=left>0-001-002</td></tr>
<tr><td align=left>004</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;Why so sad?</td><td align=left>0-001-002-004</td></tr>
<tr><td align=left>006</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Skool sukz</td><td align=left>0-001-002-004-006</td></tr>
<tr><td align=left>008</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Grow up</td><td align=left>0-001-002-004-006-008</td></tr>
<tr><td align=left>007</td><td align=left>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Cheer up</td><td align=left>0-001-002-004-007</td></tr>
<tr><td align=left>003</td><td align=left>Foo was here</td><td align=left>0-003</td></tr>
<tr><td align=left>005</td><td align=left>&nbsp;&nbsp;Kilroy was there</td><td align=left>0-003-005</td></tr>
</table>

If you're using this to store comments on blog posts, you can add a post_id field that contains a unique ID for the blog post and use the table to store comments for all of your blog posts. Simply do the following select...

SELECT ID, Subject, CONCAT(parentage, '-', ID) as Lineage from 'tablename' WHERE post_id = "Surfing" ORDER BY Lineage ASC

If may not be as efficient as having a comments table for each post, but if you're posting twice a day and averaging 4 comments per post, would you rather have one table with 2920 entries, or 730 tables with 4 entries each?

You just have to preserve the autoincrement of the ID field by NEVER EVER deleting a comment. Add a "killed" field (value of either Y or N), and if you need to delete a comment, set it's killed value from N to Y. Then modify your SELECT thusly:

SELECT ID, Subject, CONCAT(parentage, '-', ID) as Lineage from 'tablename' WHERE post_id = "Surfing" AND killed = "N" ORDER BY Lineage ASC

To quote Bubba, in <i>Forrest Gump</i>: "That's about it."

- Greg

#### A variant of the recursion that solves the database problem

Submitted by twohills on September 10, 2006 - 03:03.

My problem is to walk a family tree, therefore it is not actually hierarchal - it is a complex network of interconnected hierarchies. There is no root. There is an interim root at the time of query, i.e. the person of interest from whom we want to diplay a tree of ancestors (or descendants), but there is no fixed root. Also the nature of the query changes: show all ancestors, show only male ancestors etc... So recursively walking the tree at query time would seem to be the only practical solution. The fifth answer - or rather a variant of the first - is to read ALL relevant records into memory as a flat array, then walk it recursively in memory. One database query, but simple parent/child database structure and flexible runtime recursion. Neat, huh? It's not my idea. I found it very close to home for this site. evolt.org is obviously a Drupal site (www.drupal.org). See Drupal PHP code in function book_toc in book.module.

#### Modified Preorder Tree & Flat Table are the same

Submitted by bigeek on October 4, 2006 - 20:45.

I just wanna add a quick comment. In essense, the third method descibed (Flat Table) and the last method (Modified Preorder Tree) are the same. The "order" column is the "left" and "rank" is the "right". What Modified Preorder Tree Traversal says is: travel the tree top-down-right-first (meaning: go from a parent to its rightmost child). And when you reach a leaf (a node with no children), you go back up to the last parent's second child. Now, the IMPORTANT part. Label each node while you are doing this. Start from number 1. When you are going down the tree, place a left label, and when you are going back up the tree, place a right label. Try it out, with a pen and paper, you will get it.

#### Sorting with lineage

Submitted by DejaVudew on November 9, 2006 - 15:49.

You should just have to add a nullable parentId column (essentially combining the adjacency model and the "flat table" model that ironclad presented) to be able sort the hierarchy correctly.

#### How do i create this kind of tree

Submitted by halfdogg on November 22, 2006 - 10:31.

I'm working on a classifieds listing website. Your help shall be much appreciated.
Regards.

Submitted by adelevie on July 12, 2008 - 19:51.

While very efficient for retrieving data, the modified preorder tree traversal algorithm is very slow and in efficient (because of lengthy recursion) when it comes to adding or deleting nodes from the tree. I wrote a little post here about another method. This method never requires recursion.

Submitted by inash on March 4, 2009 - 23:53.

Hello. I've been successfully using this method to organize hierarchical data with a couple of applications. But I'm in need for a method that works as efficient with the exception of a node having multiple parents. Haven't been able to figure out a way yet? Some advice is highly appreciated. Cheers.

#### recursive method

Submitted by blaah11 on September 25, 2009 - 23:02.

actually, there is a very simple solution to go around the main criticism of the recursive method (too many query's and db connections) ... as already mentioned above (as an idea) - just do ONE query, load everything into array and do recursive processing after that in memory ... just select all, and emulate the parent>child relation in a results array. it is really easy do implement and it literally kills the main arguments against the recursive method. it has to be some mighty tree, if you run out of memory. anyway, the concept is really simple, I'll post the (very simple) solution if this gets any curious feedback. i have a working solution that does just that and afterwords a child class spits out a json tree to extjs library tree widget to use. It's too "fuzzy" to just copy-paste here, it needs to be cut down to illustrate/underline the point, but - the basic idea is really simple, not clever at all (not as suggested above, that some clever solution is needed)

#### multiple parents

Submitted by blaah11 on September 25, 2009 - 23:26.

don't complicate things! instead of multiple parents, just develop a workaround that ties different items together, something like "symbolic links" in file system level. No need to reinvent the wheel.... keep the base tree strictly classic and use outside concepts to interconnect various branches and child items together

#### Recursion with linear access

Submitted by stephen_mcd on January 27, 2010 - 00:14.

I wrote up a short article describing how you can get linear performance with the simplicity of recursion.