Skip to page content or Skip to Accesskey List.

Work

Main Page Content

Quick And Dirty Record Sorting In Perl

Rated 4.27 (Ratings: 4)

Want more?

  • More articles in Code
 

dwayne h

Member info

User since: 26 Apr 1999

Articles written: 3

It's a common thing - you've got some text files lying around with data in them, and you want to get it into some kind of reasonable order. Say, alphabetical, rather than whatever order your form script dumped the data into them.

It's a simple enough task, and doing it is going to give us a couple of techniques that should come in handy in any number of situations.

We just need two tools to accomplish this - the sort function and references.

First, sorting. Perl has a handy sort function that you can use as simply as:

@sorted_names = sort @names;

which will give you an alphabetically sorted array of the words contained in @words. As usual, perl makes some assumptions about what you want done if you're not explicit about it. Unless you tell it otherwise, sort will return the list that you gave it, sorted in standard string comparison order, ie, 'a' before 'b', and 'A' before 'a'. You may recognize this as the order ASCII runs in.

This is all well and good, but it's not quite what we want. What if someone like rudy enters their last name in lowercase - we want 'limeback' to come before 'Lombardo'. So when time comes for us to sort the data, we're going to want to make sure that all the last names are compared as if they were the same case. Well use lc to lowercase everything we sort.

Back to sort. When we say:

@sorted_names = sort @names;

what perl actually thinks is:

@sorted_names = { $a cmp $b } @names;

That { $a cmp $b } is the key to the whole thing. It's a block of code that describes how we want things sorted. For each element in the list that we pass to sort, it runs this code and keeps track of whether $a or $b comes first. If $a should come first, the block should return a negative number, if $b should come first, it should return a positive number. If they're the same and their order doesn't matter, the block should return 0. The cmp and <=> operators are perfect for this, 'cause that just happens to be what they do. If you use something like the above, you'll get your list sorted in ascending alphabetical order. If you want descending order, you use $b cmp $a.

Right about now you're probably wondering where the hell $a and $b came from. I'm going to cop out and say they're magic. Just think about what you want to happen to any two values in your list, and imagine $a and $b being assigned their values. (This is roughly what happens anyway).

Combining sort and lc, we find that we can do something like:

@sorted_names = { lc($a) cmp lc($b) } @names;

So now we have our sorting dealt with. Next we need to think about our data structure.

Let's say we have a tab delimitted file that looks something like this:

lombardo \t

guy \t


Las Vegas \t


Caesar's \t


guy@caesars.com \t


Mullholloch \t


Big Ed \t


Boondocks \t


Zippy Corp \t


biged@boonies.com Arban \t


Louis \t


Green River \t


Eviltoys \t


al@somewhere.com

Essentially, what we have is a list of records, each record having a bunch of fields for each persons' last name, first name, location, employer and email address. If we only had one record to deal with, we might use a hash to keep track of it's values, and key the hash on the type of value we were keeping track of. Then when we wanted to find out Big Ed Mullholoch's email, for example, we could say:

print $big_ed{email};

which is great if we only have one person to keep track of. But we have a bunch.

The usual way to keep track of a bunch of individual things in perl is an array. So the natural thing to do seems to be to put each person's information into a hash, and put each of these hashes into an array. Then we'll have (surprise!) an array of hashes. We can then do all kinds of things like iterate through the array and pick out everyone's email address. Or, more germane to our problem, pick out their last name so we can sort on it.

Arrays in perl are fabulous. You can put any kind of scalar value into an element of an array. You can even put a reference to another array or a hash into them. This is the basis for all complex data structures, and this is exactly what we're going to do. We're going to read through our flat file, split up each persons' information into a hash, then put a reference to that hash into an element of an array. What we'll end up with is an array of hashes where each element of the array is refers to one persons information, and each key/value pair of the hash is a particular bit of data about that person.

Let's start with Big Ed. We've already got his data in the hash %big_ed. It has keys for his first and last name, email, etc. To put his information into an element of an array, we could say:

$people[0] = \%big_ed;

The backslash indicates that what we're assigning to the first element of @people is a reference to the hash %big_ed. (this is important. if you use:

$people[0] = %big_ed;

perl works with %big_ed in scalar context, and you'll get something weird like '5/8' assigned to $people[0]. This number doesn't mean much of anything to anyone except perl).

(For more on references, check out the perlref documentation in your perl installation (type 'perldoc perlref' at the command line, or look for it in the html directory of your perl installation). This is also a good tutorial).

Now that we have Big Ed's data in @people, how do we get it out? The syntax is actually pretty intuitive if you're used to working with arrays and hashes anyway. To get at an element of an array, you use $array[0]. To get at an element of a hash, you use $hash{key}. Perl knows by the kind of brackets you use whether you're talking about an array or a hash. Same with references. The only thing you have to remember is that you need to add another $ in front to tell perl that you want the value that $people[0] points to, not the actual value in $people[0]. The value in $people[0] is something like 'HASH(0x82d01fc)'. This value tells perl what's being pointed to (a hash) and where to find it. To get Big Ed's email, now we can say:

     print $$people[0]{email};

and the output will be 'biged@boonies.com'.

Now we've got all the tools we need to make a data structure containing everybody's information, sort it, and spit it back out again. Probably the easiest way to see this is just to give you the code. Hopefully I've been clear enough that with a few comments you can figure out what's going on. If I've been completely murky, you can find me on thelist and I'll try to clarify.

There's just one last expansion to the above description of getting Big Ed's data into @people. We don't actually have to put his info into a hash first. If we assign a piece of data to an element of a hash in an array, the necessary data strucutre will just pop into existence. If we say:

     $people[0]{firstname} = 'Big Ed';

perl knows you want to put a hash with the key 'firstname' and the value 'Big Ed' into the first element of @people. This is immensely useful in things like while loops, as we'll soon see.

Below is a stripped down script that reads in a file like the one above into an array of hashes and prints it out again. I've written it as if it was an offline script just to make it easier to read.

#!/usr/bin/perl

# always use strict, at least in development.

# It makes you keep good habits.

use strict;

open (FH, "data.txt");

# the array where we'll keep everyone's data

my @people;

# counter

# we'll use this to keep track of which

# array element we're assigning to

my $i = 0;

# iterate through each line in data.txt

while (<FH>) {

# hashes popping into existence and getting

# stored in @people

($people[$i]{lastname},

$people[$i]{firstname},

$people[$i]{location},

$people[$i]{employer},

$people[$i]{email}

) = split(/\t/);

# $_ is implicit

# increment our counter by one so we can assign the

# next person's data to the next element of @people

$i++;

}

# sort @people according to the value of 'lastname' in each element's hash

my @sorted_people = sort(

{

lc( $$a{lastname} )

cmp lc( $$b{lastname} )

}

@people

);

# now we iterate through @sorted_people, assigning each hash

# in it to $person. Remember that we need two '$'s because

# $person is a reference to a hash. If we just have one '$', perl

# will think we're looking for the values of '%person', which

# doesn't exist

for my $person(@sorted_people) {

print "$$person{lastname},

$$person{firstname},

$$person{location},

$$person{employer},

$$person{email}

";

}

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.org Evolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.