Skip to page content or Skip to Accesskey List.

Work

Main Page Content

Using Apache To Stop Bad Robots

Rated 4.19 (Ratings: 29)

Want more?

 

Daniel Cody

Member info

User since: 14 Dec 1998

Articles written: 146

The honest truth about bad robots

For just about as long as the commercial Internet has existed, SPAM email has been the bane of users worldwide.

The harder and harder we try

to fight the spammers and keep our email addresses out of their hands, the smarter they get and the harder they fight back.

One example of peoples attempts to fight back is the large numbers of joe@NOSPAM.email.com, NO.mary.SPAM@REMOVESPAM.mary.com, etc email addresses

you find on Usenet and web based communities these days. Worse yet, many people hold back from contributing

to online discussions for fear their email address will be available for evil web spiders (I call them Spiderts - A web spider with a Catbert

type personality) to harvest and exploit from mailing list archives.

As one who runs(and uses!) evolt's mailing lists, keeping thousands of people's email addresses out of the tentacles of Spiderts has always

been a big concern of mine. At first, it was easily remedied by using the %40 'trick'. Instead of writing archives with an easily

recognizable email address (abuse@aol.com for example), I had our mailing list software write all email addresses as

abuse%40aol.com

This still allowed for a fairly easy to read address for humans while maintaining the ability to click the mailto: link

and have one's associated email client create a new message with the correct email address entered. The Spiderts wouldn't recognize

abuse%40aol.com as a valid email address and therefore not harvest it.

This was a fairly good solution until its use became widespread, at which point the creators of the Spiderts tweaked their

unholy creations to recognize abuse%40aol.com as a harvestable email address and siphon it as well. As if it couldn't get worse, it was

also becoming apparent that the newer generations of Spiderts don't play by the rules set out for web spiders, and would disregard

any "Disallow: /" entries in the robots.txt file. In fact, I've seen Spiderts that only go for what we specifically tell them

not to! What's a webmaster to do?!?

Setting the trap

The first step in our war against the Spiderts is to identify them. There are many techniques to find out who the bad bots are, from manually searching your access_logs to using a maintained list and picking which ones you want to exclude.

At the end of the day it's getting the robots name - its User-Agent - that's important, not how you get it. That said, here's

a method I like that targets the worst offenders.

Add a line like this to your robots.txt file:

Disallow: /email-addresses/

where 'email-addresses' is not a real directory. Wait a decent amount of time (a week to a month) then go through your access_log file and

pick out the User-Agent strings that accessed the /email-addresses/ directory. These are the worst of the worst - those that

blatantly disregard our attempts to keep them out and fill our Inboxs with crap about lowering mortgage rates. An easy way to get

a listing of those User-Agents that did access your fake directory (my examples are with grep and awk, win32 folks can check out

Cygwin tools) with a combined access_log format is

with the following command:

grep \/email-addresses access_log awk '{print $12}' uniq

This simply searches the access_log file for any occurrences of /email-addresses, then prints the 12th column (Where $12 is the column of your access_log that contains the User-Agent string) of its results, then filters

it down so only unique entries show. More on grep and awk can be found at the

GNU software page.

Now that we have their identities, we can put the mechanisms in place to keep these hell-spawns away from our email addresses.

Hook, line and sinker

Here are a couple of the User-Agents that fell for our trap that I pulled out of last months access_log for lists.evolt.org:

Wget/1.6

EmailSiphon

EmailWolf 1.00

To learn more about these and other web spiders, check out http://www.robotstxt.org.

Now that we have the names of what these Spiderts go by, there are a couple ways to block them. You can use mod_rewrite as

described here, but mod_rewrite can be difficult to configure

and learn for many. It's also not compiled into Apache by default, which makes it slightly prohibitive.

We're going to use the environment variable features found in Apache to fight

our battle, specifically the 'SetEnv' directive. This is a simple alternative to mod_rewrite and almost everything needed is compiled in to the webserver by default.

In this example, we're editing the httpd.conf file, but you should be able to use it in an .htaccess file as well.

The first line we add to our config file is:

SetEnvIfNoCase User-Agent "^Wget" bad_bot

SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot

SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot

The 'SetEnvIfNoCase' simply sets an enviornment (SetEnv) variable called 'bad_bot' If (SetEnvIf) the 'User-Agent'

string contains Wget, EmailSiphon, or EmailWolf, regardless of case (SetEnvIfNoCase). In english, anytime a browser with a name containing 'wget, emailsiphon, or emailwolf'

accesses our website, we set a variable called 'bad_bot'. We'd also want to add a line for the User-Agent string of any other Spidert we

want to deny.

Now we tell Apache which directories to block the Spiderts from with the <Directory> directive:


<Directory "/home/evolt/public_html/users/">

Order Allow,Deny

Allow from all

Deny from env=bad_bot

</Directory>

In english, we're denying access to the /home/lists/public_html/archive directory if the environment variable exists called 'bad_bot'. Apache

will return a standard 403 Denied error message, and the Spidert gets nothing!

Since most of the email addresses of members are found in lists.evolt.org/archive, this should suffice, but you'll

probably want to adjust a couple things to fit your needs.

There are many resources on the Web for discovering the User-Agent strings of Spiderts. The difficult part until now has been

the process of actually blocking them from your server. Thankfully, Apache provides us with the ability to easily block

those harbingers of SPAM from our servers and most importantly, our online identities.

Dan lives a quiet life in the bustling city of Milwaukee, WI. Although he founded what would become evolt.org in 1998, he's since moved on to other projects and is now the owner of Progressive Networks, a Zimbra hosting company based in Milwaukee.

His personal site can be found at http://dancody.org/

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.org Evolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.