In the initial article on title="" target="_new">Using Apache to Stop Spambots, we laid the foundation for using Apache to identify, trap, and block Spambots (or Spamberts) from your website. In this article, we'll be addressing some of the concerns brought up by readers in response to the initial article. We'll also introduce some more full-proof tools, methods, and procedures for keeping the Spiderts out.

A Recap

Spiderts are evil. Since the original article in which we discussed some of the reasons to keep email harvesting robots off our sites was written nearly 5 months ago, the amount of SPAM people are receiving seems to have increased in frequency and intensified in its directness. Even the U.S. Government has finally started

to take notice and has started considering legislation related to the subject. Other types of SPAM, such as the pornographic variety, are increasingly thrust into the public spotlight as it becomes more and more of a problem for the average Netizen.

In the first article, we showed how the SetEnvIfNoCase Apache directive could effectively be used to block Spiderts from your website by comparing their User-Agent string to a list that contained known Spidert User-Agentss or the User-Agents that had accessed a fictitious directory within our robots.txt file. If the User-Agent was found within the banned list, an environment variable was set within Apache that would return a "403 Denied" error message to that User-Agent.

As many readers pointed out, however, what about those Spiderts that used common User-Agent strings such as "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"? After all, if some Spiderts are evil enough to index directories explicitly denied in the robots.txt file, surely they would have no problem using a fake User-Agent string to slip past our defences!

In this article, we'll discuss a more full-proof method to block the Spiderts from our websites by using their IP address - which is much harder to spoof - instead of their User-Agent which can be modified to match commonly used strings.

The Setup

As we did last time, we'll be using a "honey-pot" directory within our robots.txt file to lure the Spiderts into a non-existent directory. Here's an example honey-pot robots.txt file:

User-agent: *

Disallow: /user/

# Here's our honey-pot directory, which doesn't actually exist.

Disallow: /email-addresses/

What we'll be doing differently this time is blocking their IP address, and not their User-Agent string. As an added little bonus, we're also going to be using a Perl script to automatically parse our access_log and update a list of banned IP addresses which Apache

will be reading! This will alleviate the problem of catching a Spidert "after the fact" by dynamically sending the "403 Denied" error message

the first time it tries to access the honey-pot directory. All right! Let's move on...

The Flypaper

Here is the Perl script that will traverse our access_log looking for IP addresses which access the honey-pot directory (Thanks to Dean Mah for coding the Perl script for me!):

#!/usr/bin/perl -w

#

# greplog.pl - monitor the access_log for agents accessing the honey-pot

# and block their IP addresses.

#

# Notes:

#

# - The log format is currently hardcoded. Could grab this from the Apache

# configuration file.

#

#

# DSM - Dean Mah (dmah at members.evolt.org)

use strict;

use File::Tail;

$ = 1;

# Configuration.

my $access_log = '/usr/local/apache/logs/access_log';

my $pidfile = '/usr/local/apache/logs/httpd.pid';

my $block_file = '/usr/local/apache/conf/badbots.txt';

# Constants.

my $SIGUSR1 = 10;

# Open the access log.

my $file = File::Tail->new($access_log);

my $pid = '';

# Grab a list of already blocked IP addresses.

my %blocked = %{ init_blocked($block_file) };

while (defined($_ = $file->read)) {

my ($ip, $path, $agent) = m ^([\d\.]+).+GET\s+(\S+).+\"([^\"]+)\"$ ;

if ($path =~ /email-addresses/) {

# If the IP hasn't been blocked already, add it to the blocked

# IP list and restart the server.

if (!defined($blocked{$ip})) {

if (open(FILE, ">> $block_file")) {

print FILE "SetEnvIfNoCase Remote_Addr $ip bad_bot

";

close(FILE);

}

$blocked{$ip} = 1;

# Get Apache to re-read its configuration. We need to

# keep grabbing Apache's pid in case it changes while running.

kill $SIGUSR1, $pid if (($pid = get_httpd_pid($pidfile)) != 0);

}

}

}

exit(0);

########## Subroutines ##########

#

# init_blocked

#

sub init_blocked {

my $block_file = shift;

my %blocked = ();

if (open(FILE, $block_file)) {

while (<FILE>) {

chomp;

$blocked{$_} = 1;

}

close(FILE);

}

return \%blocked;

}

#

# get_httpd_pid

#

sub get_httpd_pid {

my $pidfile = shift;

open(FILE, $pidfile) return 0;

chomp(my $pid = <FILE>);

close(FILE);

return $pid;

}

First, we define the PATH to the access_log, the PID file for Apache, and the name of the file we're going to write the IP addresses of the offending Spiderts to. Adjust these variables to taste, as they'll more than likely vary from system to system.

While the dissection of the Perl script is outside the scope of this article, let's run through it quick so you can get an idea of exactly what we're doing.

Basically it looks at the access_log and if the "email-addresses" (Remember, this is our honey-pot directory from the robots.txt file.) string is anywhere in the GET string of the access_log, it will add a new environment variable to the badbots.txt file using the SetEnvIfNoCase directive that contains the IP address of the client who's poking its nose where it shouldn't be. If it's a new IP address, the Perl script will restart Apache so it can re-read the badbots.txt file and block out the new entry. If you'd like to get creative, adding a sub-routine that emails an administrator each time a new IP address is blocked might be a nice idea.

As is, the Perl script will continously 'tail' the access_log file, but you may want to modify it to run as a cron job, or manually run it yourself from time to time. Whatever floats your boat!

Wrapping Up

In our Apache httpd.conf file, we're going to include the file containing the blocked IP addresses with the Include directive like so:

# Include $APACHE_HOME/conf/badbots.txt

Include conf/badbots.txt

And just in case you forgot from the last article, we would use the following code to block any clients that match the "bad_bot" environment variable from a directory:

<Directory "/home/djc/public_html/*">

Order Allow,Deny

Allow from all

Deny from env=bad_bot

</Directory>

It should also be mentioned that the Perl script could easily be modified to output the IP address of the Spidert to a firewall rule that would completely block any TCP/IP traffic from the offending IP address to the server. That exercise is left to the reader as it doesn't have too much to do with Apache!

When you've got everything configured, just restart Apache, run the Perl script, and feel confident that just because your email address is on a publicly-available website, it's much less likely to be harvested and abused. And while there is no full-proof method to keep your email address completely out of the hands of SPAMers, we can (and will!) keep refining our techniques to make it as difficult as possible.

Have suggestions or comments? I'd love to hear what you have to say below!