Skip to page content or skip to Accesskey List.
Search evolt.org
evolt.org login: or register

Work

Main Page Content

Stopping Spambots II - The Admin Strikes Back

Rated 4.1 (Ratings: 10) (Add your rating)

Log in to add a comment
(9 comments so far)

Want more?

 
Picture of djc

Daniel Cody

Member info | Full bio

User since: December 13, 1998

Last login: September 17, 2007

Articles written: 146

In the initial article on Using Apache to Stop Spambots, we laid the foundation for using Apache to identify, trap, and block Spambots (or Spamberts) from your website. In this article, we'll be addressing some of the concerns brought up by readers in response to the initial article. We'll also introduce some more full-proof tools, methods, and procedures for keeping the Spiderts out.

A Recap

Spiderts are evil. Since the original article in which we discussed some of the reasons to keep email harvesting robots off our sites was written nearly 5 months ago, the amount of SPAM people are receiving seems to have increased in frequency and intensified in its directness. Even the U.S. Government has finally started to take notice and has started considering legislation related to the subject. Other types of SPAM, such as the pornographic variety, are increasingly thrust into the public spotlight as it becomes more and more of a problem for the average Netizen.

In the first article, we showed how the SetEnvIfNoCase Apache directive could effectively be used to block Spiderts from your website by comparing their User-Agent string to a list that contained known Spidert User-Agentss or the User-Agents that had accessed a fictitious directory within our robots.txt file. If the User-Agent was found within the banned list, an environment variable was set within Apache that would return a "403 Denied" error message to that User-Agent.

As many readers pointed out, however, what about those Spiderts that used common User-Agent strings such as "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"? After all, if some Spiderts are evil enough to index directories explicitly denied in the robots.txt file, surely they would have no problem using a fake User-Agent string to slip past our defences!

In this article, we'll discuss a more full-proof method to block the Spiderts from our websites by using their IP address - which is much harder to spoof - instead of their User-Agent which can be modified to match commonly used strings.

The Setup

As we did last time, we'll be using a "honey-pot" directory within our robots.txt file to lure the Spiderts into a non-existent directory. Here's an example honey-pot robots.txt file:

User-agent: *
Disallow: /user/
# Here's our honey-pot directory, which doesn't actually exist.
Disallow: /email-addresses/

What we'll be doing differently this time is blocking their IP address, and not their User-Agent string. As an added little bonus, we're also going to be using a Perl script to automatically parse our access_log and update a list of banned IP addresses which Apache will be reading! This will alleviate the problem of catching a Spidert "after the fact" by dynamically sending the "403 Denied" error message the first time it tries to access the honey-pot directory. All right! Let's move on...

The Flypaper

Here is the Perl script that will traverse our access_log looking for IP addresses which access the honey-pot directory (Thanks to Dean Mah for coding the Perl script for me!):

#!/usr/bin/perl -w
#
# greplog.pl - monitor the access_log for agents accessing the honey-pot
#              and block their IP addresses.
#
# Notes:
#
# - The log format is currently hardcoded.  Could grab this from the Apache
#   configuration file.
#
#
# DSM - Dean Mah (dmah at members.evolt.org)

use strict;

use File::Tail;

$| = 1;

# Configuration.

my $access_log = '/usr/local/apache/logs/access_log';
my $pidfile = '/usr/local/apache/logs/httpd.pid';
my $block_file = '/usr/local/apache/conf/badbots.txt';

# Constants.

my $SIGUSR1 = 10;

# Open the access log.
my $file = File::Tail->new($access_log);

my $pid = '';

# Grab a list of already blocked IP addresses.
my %blocked = %{ init_blocked($block_file) };

while (defined($_ = $file->read)) {
    my ($ip, $path, $agent) = m|^([\d\.]+).+GET\s+(\S+).+\"([^\"]+)\"$|;

    if ($path =~ /email-addresses/) {

        # If the IP hasn't been blocked already, add it to the blocked
        # IP list and restart the server.

        if (!defined($blocked{$ip})) {
            if (open(FILE, ">> $block_file")) {
                print FILE "SetEnvIfNoCase Remote_Addr $ip bad_bot\n";
                close(FILE);
            }

            $blocked{$ip} = 1;

            # Get Apache to re-read its configuration.  We need to
            # keep grabbing Apache's pid in case it changes while running.
            kill $SIGUSR1, $pid if (($pid = get_httpd_pid($pidfile)) != 0);
        }
    }
}

exit(0);
########## Subroutines ##########

#
# init_blocked
#

sub init_blocked {
    my $block_file = shift;

    my %blocked = ();

    if (open(FILE, $block_file)) {
        while (<FILE>) {
            chomp;
            $blocked{$_} = 1;
        }
        close(FILE);
    }

    return \%blocked;
}

#
# get_httpd_pid
#
sub get_httpd_pid {
    my $pidfile = shift;

    open(FILE, $pidfile) || return 0;
    chomp(my $pid = <FILE>);
    close(FILE);

    return $pid;
}

First, we define the PATH to the access_log, the PID file for Apache, and the name of the file we're going to write the IP addresses of the offending Spiderts to. Adjust these variables to taste, as they'll more than likely vary from system to system.

While the dissection of the Perl script is outside the scope of this article, let's run through it quick so you can get an idea of exactly what we're doing.

Basically it looks at the access_log and if the "email-addresses" (Remember, this is our honey-pot directory from the robots.txt file.) string is anywhere in the GET string of the access_log, it will add a new environment variable to the badbots.txt file using the SetEnvIfNoCase directive that contains the IP address of the client who's poking its nose where it shouldn't be. If it's a new IP address, the Perl script will restart Apache so it can re-read the badbots.txt file and block out the new entry. If you'd like to get creative, adding a sub-routine that emails an administrator each time a new IP address is blocked might be a nice idea.

As is, the Perl script will continously 'tail' the access_log file, but you may want to modify it to run as a cron job, or manually run it yourself from time to time. Whatever floats your boat!

Wrapping Up

In our Apache httpd.conf file, we're going to include the file containing the blocked IP addresses with the Include directive like so:

# Include $APACHE_HOME/conf/badbots.txt
Include conf/badbots.txt

And just in case you forgot from the last article, we would use the following code to block any clients that match the "bad_bot" environment variable from a directory:

<Directory "/home/djc/public_html/*">
        Order Allow,Deny
        Allow from all
        Deny from env=bad_bot
</Directory>

It should also be mentioned that the Perl script could easily be modified to output the IP address of the Spidert to a firewall rule that would completely block any TCP/IP traffic from the offending IP address to the server. That exercise is left to the reader as it doesn't have too much to do with Apache!

When you've got everything configured, just restart Apache, run the Perl script, and feel confident that just because your email address is on a publicly-available website, it's much less likely to be harvested and abused. And while there is no full-proof method to keep your email address completely out of the hands of SPAMers, we can (and will!) keep refining our techniques to make it as difficult as possible.

Have suggestions or comments? I'd love to hear what you have to say below!

Dan lives a quiet life in the bustling city of Milwaukee, WI. Although he founded what would become evolt.org in 1998, he's since moved on to other projects and is now the owner of Progressive Networks, a Zimbra hosting company based in Milwaukee.

His personal site can be found at http://dancody.org/

Not Keen to restart Apache.

Submitted by dannycarroll on February 19, 2002 - 02:03.

Instead of restarting Apache could you not set a bad-cookie for your honeypot, to be disabled by the protected dir?

That way if you force cookies for your protected dir you would be disallowing anyone who A) has cookies un-set or B) accesses the honeypot dir.

I think a number of approaches is what is needed here. I also like the invisible link on a front page, if it is the first link on a page then you'd have to get a lot of spammers as well. That link could be to your honeypot dir as well.

-D

login or register to post comments

What if they are using a dial-up pool?

Submitted by neoliminal on February 19, 2002 - 11:03.

If another user in the pool gets the same IP number, you're going to block them from viewing the site. Could the filter be a bit more sophisticated and have some rules to determine if the spider is back rather than a legitimate user. In otherwords, is there a way to code for this type of error.

login or register to post comments

Announcing Robotcop.org

Submitted by Robotcop on February 19, 2002 - 16:02.

Wow, excellent timing. In response to your previous article on this subject last year, several of us set out to write an Apache module which would implement your suggestions! http://www.robotcop.org/ just went up, and a first working version is available.

The basic idea is that it uses the techniques described above, but once caught the spider is blocked EVERYWHERE on your site, even across virtual hosts, for a configurable period of minutes. It does some pretty cool stuff like robots.txt parsing to automatically block spiders who read your rules and ignore them. It comes with basic tarpit + poisoned e-mail list functions, and it's easy to add your own "arrest handlers" to do whatever evil things you want to misbehaving spiders.

I hope people will try it out and let me know how it goes! Our approach probably isn't the best for everyone, but for many sites this works much better than a Perl CGI, and with way less setup required than mod_rewrite hacking.

P.S. Coming Soon: blacklist sharing for server farms (RBL anyone?), Apache 2.0 and IIS support. :-)

login or register to post comments

RE: dial-up

Submitted by djc on February 19, 2002 - 16:11.

neo, its very unlikely that anyone would be scrapping email accounts from a dial up account due to the large amount of bandwidth it would take.
that said, this is why i mentioned it might be a good idea to put a mailer function in the perl script so the admin would know when an IP address was blocked, and could remove it at a later date if s/he was comfortable with that ;)

login or register to post comments

a similar function in a mod_perl module

Submitted by amoore on February 22, 2002 - 11:28.

This sounds like a vey useful approach, and it remonds me of something I have seen before, perhaps at directnic.net's free hosting site. Anyway, I wrote a mod_perl module to perform a similar function. It's at http://cow.mooresystems.com/~amoore/home/logs/article/2002/02/21/01.html. I hope it is useful to someone. Thanks for the inspiration!

login or register to post comments

You don't have to block the bots

Submitted by efti on February 25, 2002 - 02:31.

As much as I would like to do something nasty to the spambot like feed it a never-ending HTML page, this may not always be a good idea. Some people suggested blocking wget, unknown User-Agents or certain IP addresses. Obviously this could lead to legitimate users denied access.

What you could try instead is give suspicious visitors a version of your content that doesn't contain any email addresses. For example dynamic pages could have code that displays or hides email addresses (for example based on the presence of an URL variable that gets added either by mod_rewrite or by the PHP/Perl/Whatever code itself). This way nobody would be denied any content but the spambots would still leave empty-handed.

Just a thought

login or register to post comments

maybe ?

Submitted by Junglee on March 11, 2002 - 02:45.

To be honest..i 've never installed Apache in my life... So i am not even sure if this is relevant or would make sense...here goes..
not so long back i did statistics in uni for an year...I was just wondering wouldnt it be feasible to use some kind of correlation coefficient as to browsing /site navigation characteristics of spambot and a normal user browsing the site to distinguish between the two. for e.g. a human user may not sequentially click on all the links of a page at fixed intervals of time...he/she might do it randomly as against a bot browsing the same site...

login or register to post comments

Re: maybe?

Submitted by Robotcop on March 14, 2002 - 23:53.

The problem with catching spiders by watching request rates is that sometimes many users are sharing the same IP address, such as AOL, they may appear to be a really fast spider when a group of them happen to be surfing your site at the same time.

Check out this discussion at Slashdot about Robotcop which covers a lot ideas on blocking spiders and some of the problems with those techniques.

login or register to post comments

using the firewall.

Submitted by sharp on June 26, 2002 - 23:02.

I have it set up on my web server so that if a person tries to make use of an NT vuneribility, or uses a spider that disobeys robots.txt and goes after a "bait file", their ip gets listed for all to see and with in a minute, is completely locked out from my server with the help of ip chains. It took a little shell scripting, but I got it so a 404 script would id if the bot was following the bait (which doesn't exist), leave a message in a little file to a script, which crontab woulud execute as root every minute. I firewalled the spambot which got my attention before I got this idea, so as of this writing I only have people trying NT vunerabilites in the "public logfile," but if you want to see what it can do, go to sharph.net/abuse.txt.

login or register to post comments

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.orgEvolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.