What is the "top ten" problem? Using SQL, how do you select only those rows that have one of the top ten values for a certain column?

Suppose we have a SAMPLETABLE of people and scores.

The "top ten" problem here is how to select the people who have the top ten scores.


Here's an example:

PERSON SCORE
Tom 2
Dick 15
Harry 4
Winken 3
Blinken 11
Nod 7
Dopey 7
Sneezy 21
Bashful 17
Doc 2
Dasher 7
Dancer 9
Comet 7
Cupid 11
Donner 11
Blitzen 13

Can't I just sort the data?

One way to find the top ten values in a column would be to select all rows and bring them back sorted in descending order by that column:

SELECT PERSON

, SCORE

FROM SAMPLETABLE

ORDER BY SCORE DESC

PERSON SCORE
Sneezy 21
Bashful 17
Dick 15
Blitzen 13
Blinken 11
Donner 11
Cupid 11
Dancer 9
Nod 7
Dopey 7
Dasher 7
Comet 7
Harry 4
Winken 3
Tom 2
Doc 2

Since the result set comes back sorted, all you need to do is use the top ten records.

Some front-end programs, like Sybase's Infomaker and Microsoft's Access, allow you to specify a parameter to limit the number of rows in the result set.

In Infomaker it's called maximum number of rows retrieved, but for this discussion, we'll use the Access term, TopValues.


Uh oh, what about ties?

Take another look at the sorted scores, this time with the results numbered like they do in races.

PERSON SCORE PLACE
Sneezy 21 1st
Bashful 17 2nd
Dick 15 3rd
Blitzen 13 4th
Blinken 11 5th
Donner 11 5th
Cupid 11 5th
Dancer 9 8th
Nod 7 9th
Dopey 7 9th
Dasher 7 9th
Comet 7 9th
Harry 4 13th
Winken 3 14th
Tom 2 15th
Doc 2 16th

Did you notice that there are actually several people tied for 9th place?

How would you interpret the "top ten" problem in light of the possibility of ties across 10th place?

Most people would agree that it makes sense to include ties.


Note: No, the SAMPLETABLE does not contain a PLACE column (if it did, you could just select all rows where PLACE < 10, and then where would this tutorial be?). The place values are for illustrative purposes only, and no, it is not easy to generate PLACE values into a table -- that's another SQL tutorial for another day.

Okay, I understand the problem -- what's the solution?

If you don't have a TopValues feature.....

If your front end program doesn't have a TopValues feature, you can just select the entire table with an ORDER BY, then inspect the results (watching for ties), and discard all records past the place you're looking for -- or past the last of the ties.

This is a manual procedure but it's simple.

Once you have determined what the cutoff point is, you would then re-run the query with an absolute value like this:

SELECT PERSON

, SCORE

FROM SAMPLETABLE

WHERE SCORE >= 7

ORDER BY SCORE DESC

Why TopValues doesn't work well

If your front end program does have a TopValues feature, go ahead and use it -- but be sure to set it high enough to get at least one record beyond any ties for the last place you're looking for.

This, too, is a manual procedure, and here's the problem -- you may have to repeat it a few times, or turn off the TopValues feature at least once.

Why is this?

In the last example above, you would have to set TopValues to 13 in order to get past the ties for 9th place. The problem here, of course, is that you don't know in advance how high to go, whether 13 is high enough to snag all the ties across 10th place.

Imagine the case where every person has the same score, i.e. everybody is tied for first place. You would have to set TopValues to a number that is higher than the number of records in the table. With luck, you might happen to notice that the first 10 people had the same score, and realize that something was up.

In the worst case scenario, though, the top few scores are all different, but there's a tie that starts at 10th place and goes on from there. You probably wouldn't realize that you needed to increase TopValues. Therefore it's always a good idea to set it substantially higher than you need to.

This, in turn, means that even when you do use the TopValues feature, you will still have to inspect the results -- you have to make sure you got past any ties, and discard any records past the place you're looking for or past the last of the ties.

But with or without the TopValues feature, sorting the table is simple and efficient -- although it does require manual intervention.

Isn't there an automated way?

Yes, with the "top ten" SQL --

SELECT PERSON

, SCORE

FROM SAMPLETABLE XXX

WHERE 10 > (

SELECT COUNT(*)

FROM SAMPLETABLE

WHERE SCORE > XXX.SCORE )

Let's try to understand this in stages.

First, the query uses a correlated subquery. Basically, a correlated subquery is an inner query that must be evaluated for each row of the outer query. The XXX identifies the table in the outer query, so that in the subquery, the XXX.SCORE identifies that column as belonging to the outer query's table.

So for each row of the outer query, the subquery counts the number of rows with a larger score than that of the outer row under consideration.

If there are fewer than 10 rows with a larger score, then the outer row satisfies the outer WHERE clause, i.e. it's in the top ten.

Are ties across 10th place included by the top ten SQL?

Yes.

PERSON SCORE
Sneezy 21
Bashful 17
Dick 15
Blitzen 13
Blinken 11
Donner 11
Cupid 11
Dancer 9
Nod 7
Dopey 7
Dasher 7
Comet 7

To see why this is so, let's take Dasher as a test case.

Count the number of rows that have a score greater than 7, which is Dasher's score.

The answer is eight rows, so Dasher must be in the top ten.

The top ten SQL automatically includes all ties across 10th place.


What about efficiency?

Clearly, a correlated subquery could cause a performance problem -- the database has to look at all rows of a table for every row.

On the face of it, this looks like it will have "n-squared" efficiency, which means processing requirements go up by the square of the number of rows.

Thankfully, the difference is not too noticeable for tables with a modest number of rows. If your table is large and the query seems to perform poorly, try declaring a descending index on the SCORE column.

For very large tables, you should be on an industrial-strength database server (all of which have a built-in optimizer for situations like this).

For further information...

Please feel free to send your SQL questions to

Rudy Limeback (r937@interlog.com)