Skip to page content or skip to Accesskey List.
Search evolt.org
evolt.org login: or register

Work

Main Page Content

Using Apache to stop bad robots

Rated 4.19 (Ratings: 29) (Add your rating)

Log in to add a comment
(49 comments so far)

Want more?

 
Picture of djc

Daniel Cody

Member info | Full bio

User since: December 13, 1998

Last login: September 17, 2007

Articles written: 146

The honest truth about bad robots

For just about as long as the commercial Internet has existed, SPAM email has been the bane of users worldwide. The harder and harder we try to fight the spammers and keep our email addresses out of their hands, the smarter they get and the harder they fight back. One example of peoples attempts to fight back is the large numbers of joe@NOSPAM.email.com, NO.mary.SPAM@REMOVESPAM.mary.com, etc email addresses you find on Usenet and web based communities these days. Worse yet, many people hold back from contributing to online discussions for fear their email address will be available for evil web spiders (I call them Spiderts - A web spider with a Catbert type personality) to harvest and exploit from mailing list archives.

As one who runs(and uses!) evolt's mailing lists, keeping thousands of people's email addresses out of the tentacles of Spiderts has always been a big concern of mine. At first, it was easily remedied by using the %40 'trick'. Instead of writing archives with an easily recognizable email address (abuse@aol.com for example), I had our mailing list software write all email addresses as abuse%40aol.com

This still allowed for a fairly easy to read address for humans while maintaining the ability to click the mailto: link and have one's associated email client create a new message with the correct email address entered. The Spiderts wouldn't recognize abuse%40aol.com as a valid email address and therefore not harvest it.

This was a fairly good solution until its use became widespread, at which point the creators of the Spiderts tweaked their unholy creations to recognize abuse%40aol.com as a harvestable email address and siphon it as well. As if it couldn't get worse, it was also becoming apparent that the newer generations of Spiderts don't play by the rules set out for web spiders, and would disregard any "Disallow: /" entries in the robots.txt file. In fact, I've seen Spiderts that only go for what we specifically tell them not to! What's a webmaster to do?!?

Setting the trap

The first step in our war against the Spiderts is to identify them. There are many techniques to find out who the bad bots are, from manually searching your access_logs to using a maintained list and picking which ones you want to exclude. At the end of the day it's getting the robots name - its User-Agent - that's important, not how you get it. That said, here's a method I like that targets the worst offenders.

Add a line like this to your robots.txt file:

Disallow: /email-addresses/

where 'email-addresses' is not a real directory. Wait a decent amount of time (a week to a month) then go through your access_log file and pick out the User-Agent strings that accessed the /email-addresses/ directory. These are the worst of the worst - those that blatantly disregard our attempts to keep them out and fill our Inboxs with crap about lowering mortgage rates. An easy way to get a listing of those User-Agents that did access your fake directory (my examples are with grep and awk, win32 folks can check out Cygwin tools) with a combined access_log format is with the following command:

grep \/email-addresses access_log | awk '{print $12}' | uniq

This simply searches the access_log file for any occurrences of /email-addresses, then prints the 12th column (Where $12 is the column of your access_log that contains the User-Agent string) of its results, then filters it down so only unique entries show. More on grep and awk can be found at the GNU software page.

Now that we have their identities, we can put the mechanisms in place to keep these hell-spawns away from our email addresses.

Hook, line and sinker

Here are a couple of the User-Agents that fell for our trap that I pulled out of last months access_log for lists.evolt.org:

Wget/1.6
EmailSiphon
EmailWolf 1.00

To learn more about these and other web spiders, check out http://www.robotstxt.org.

Now that we have the names of what these Spiderts go by, there are a couple ways to block them. You can use mod_rewrite as described here, but mod_rewrite can be difficult to configure and learn for many. It's also not compiled into Apache by default, which makes it slightly prohibitive.

We're going to use the environment variable features found in Apache to fight our battle, specifically the 'SetEnv' directive. This is a simple alternative to mod_rewrite and almost everything needed is compiled in to the webserver by default. In this example, we're editing the httpd.conf file, but you should be able to use it in an .htaccess file as well.

The first line we add to our config file is:

SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot

The 'SetEnvIfNoCase' simply sets an enviornment (SetEnv) variable called 'bad_bot' If (SetEnvIf) the 'User-Agent' string contains Wget, EmailSiphon, or EmailWolf, regardless of case (SetEnvIfNoCase). In english, anytime a browser with a name containing 'wget, emailsiphon, or emailwolf' accesses our website, we set a variable called 'bad_bot'. We'd also want to add a line for the User-Agent string of any other Spidert we want to deny.

Now we tell Apache which directories to block the Spiderts from with the <Directory> directive:

<Directory "/home/evolt/public_html/users/">
        Order Allow,Deny
        Allow from all
        Deny from env=bad_bot
</Directory>

In english, we're denying access to the /home/lists/public_html/archive directory if the environment variable exists called 'bad_bot'. Apache will return a standard 403 Denied error message, and the Spidert gets nothing! Since most of the email addresses of members are found in lists.evolt.org/archive, this should suffice, but you'll probably want to adjust a couple things to fit your needs.

There are many resources on the Web for discovering the User-Agent strings of Spiderts. The difficult part until now has been the process of actually blocking them from your server. Thankfully, Apache provides us with the ability to easily block those harbingers of SPAM from our servers and most importantly, our online identities.

Dan lives a quiet life in the bustling city of Milwaukee, WI. Although he founded what would become evolt.org in 1998, he's since moved on to other projects and is now the owner of Progressive Networks, a Zimbra hosting company based in Milwaukee.

His personal site can be found at http://dancody.org/

Staying one step ahead

Submitted by astro38 on August 24, 2001 - 09:22.

Great article and a great idea! Of course this probably won't work for ever. Once the 'bot writers catch wind of it they'll probably start using User Agent strings from common browsers to sneak through.

Simon

login or register to post comments

another idea

Submitted by adaw on August 24, 2001 - 09:57.

Another idea I (less practical for huge lists, but workable for mailto addresses on web pages) is converting the entire email address to ascii. The spam harvesters may be able to recognise %40 as being an ampersand, but they woudl have less chance recognising the entire email address. Of-course I could be wrong : -) me@idesign.net.au

login or register to post comments

Slight change to make it work in an .htaccess file

Submitted by djc on August 24, 2001 - 09:58.

To get this working within an .htaccess file, use the following syntax(my fault about the error!):

SetEnvIfNoCase User-Agent "^Mozilla" bad_bot<br>
<br>
Deny from env=bad_bot<br>

That will block out all User-Agent's that match 'mozilla', which probably isn't the best idea :)

login or register to post comments

spring the trap immediately

Submitted by ironclad on August 26, 2001 - 01:46.

Great article!

astro389 is correct, evil spiderts will start adopting disguises, and in fact it appears many already do, according to my logs.

Stopping the most pernicious and egregarious spiderts can be easy though:

  1. use some tool that does what mod_rewrite does on your server
  2. insert the DISALLOW /email_addresses/ line into your robots.txt file
  3. every time some visitor requests that explicitly disallowed directory you rewrite the request to a cgi that logs their IP address
  4. and finally you configure your htaccess/mod_rewrite files to deny access to any visitor whose IP address is in that log file.

Thus the spidert is kick/banned instantly, rather than much later when you get around to perusing your log files ... by which time its too late

login or register to post comments

Webtechniques article on the same issue

Submitted by mathowie on August 26, 2001 - 15:40.

In this month's Webtechniques magazine, Steve Champeon has a story on how to blockspam bots with mod_rewrite

login or register to post comments

RE: webtechniques article..

Submitted by djc on August 27, 2001 - 08:08.

I did link to that article in the 'hook line and sinker' subsection, and although its a decent aritcle, mod_rewrite can be difficult to set up for those looking for a quick solution, and isn't compiled into Apache by default..

login or register to post comments

Bad bots masquerading as normal browsers?

Submitted by CodeBitch on August 28, 2001 - 03:30.

Folks, it's already happening. Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 ) is a bot. This browser tag hit MacEdition last week, pumping up the share of MSIE 4.0 (not 4.01) dramatically. It was clearly spidering through the site, based on what the raw log files look like. It was a single IP address that didn't resolve to a hostname. There's also a Win95 version that looks like a bot, too, but I'm less sure about that one.

login or register to post comments

another tactic that can be used by the spammers !

Submitted by cbird on August 28, 2001 - 05:18.

Spammers might already be using another technique for their evil deeds, they might be using randomly generated useragent names !! well this problem can be tackeled too ! by giving access to only the recognised user agents bu that too falls to the pitfalls as discussed above, ....

login or register to post comments

Write Email addresses using javascript

Submitted by pfreitag on October 9, 2001 - 15:21.

You can also use Java Script's document.write() method to make your address hard to harvest.

document.write("fake");
document.write("\100");
document.write("address.com");

I wrote a coldfusion custom tag called cf_AntiSpam that takes care of all this for you, you can get here: Anti Spam

login or register to post comments

re: Write Email addresses using javascript

Submitted by djc on October 9, 2001 - 15:51.

Using Javascript probably isn't the best way to go for a couple reasons.. Spiderts don't process Javascript(or CSS, or anything client side for that matter) and those people who don't have Javascript enabled(or JS browsers) will be out in the code too.. Granted, it's a small percentage, but a fair amount of people none-the-less if you have a decent amount of traffic..

A CF tag to do it would be a good bet - anything that can be done server side should be done server side IMO :)

login or register to post comments

Why not give them what they ask for ? :-)

Submitted by seanl on October 18, 2001 - 23:48.

Fighting the 'bots and trying to keep them from your email addresses is great, but you are still in the situation where they want something you have. Rather than focus on denying them access to your site, why not give their owners an incentive to change their ways ?

Create some bait pages on your site with lots of bogus email addresses. Disallow them in robots.txt. If you can, detect the 'bot and redirect them to the bait pages.

When the 'bots owner starts getting complaints from their customers, maybe they will wise up and play nicely.

If everyone did this then the net would not be so friendly for badly behaved 'bots.

login or register to post comments

RE: giving them what they ask for :)

Submitted by djc on October 19, 2001 - 00:01.

seanl -

Thats pretty much what I do in the Hook, line, and sinker section of the article.. By capturing the user-agent's and IP's of the Spiderts that *blatently* disregard the robots.txt file, its like shootin fish in a barrel..

In the next installment of this article, I'm working on a script that grabs the NetBlock of a bot that goes against the robots.txt file, does a ARIN lookup on that block, and emails the administrator of that block with the prob.. Comments have been made that any bot can switch their user-agent string, which is true. If a Spidert does that though, they're more than likely also going to run through the parts of a site that you *specifically* tell them they can't go in the robots.txt file. When they do that, its a lot easier to block their user-agent, email the admin of thier netblock, or block their class c IP block alltogether.

It's like a honeypot for black-hats if you think about it.. And thats one of the *best* ways to find the problem Spiderts and block them out, without blocking any good natured bot :)

login or register to post comments

An article I wrote a few years ago has more...

Submitted by erikschorr on October 20, 2001 - 05:34.

More ideas on turning the tables on these spammers and their email harvesters. If you're familiar with CGI and/or perl programming, or just want to read about some of my ideas on this, get http://arpa.org/txt/bugspray.txt - use the username 'guest' and password 'arpa' when prompted.
I admit, it really needs to be updated, but the ideas and code (especially the bogus-email-page CGI) are still valid.

-Erik Schorr

login or register to post comments

RE: Why not give them what they ask for ? :-)

Submitted by pfreitag on October 20, 2001 - 23:15.

I think thats a great idea to feed them bogus email addresses, only instead of just makeing up a fake domain, use thier IP address, so when they start spamming their spam bot gets flodded. Or use addresses like abuse@localhost so their mail server's abuse account gets all the spam. Add the spamcop address on your bad robots page. How smart are these bots, It would be farily easy to code a infinite link loop to trap them in. Kind of like the labra tar pit idea to stopping worms.

login or register to post comments

pfreitag...

Submitted by erikschorr on October 21, 2001 - 00:44.

pfreitag, the program example in my article does exactly this. It generates pages of random text, random email addresses, and seemingly random "links". Each link is to itself, via a different URL. Being installed on a machine running multiple websites is ideal for this, since you can use an alias to make http://web-foo.com/whatever, http://web-bar.net/whatever, http://other-site.org/whatever, etc, all point to the same CGI, to make it appear to the email bot that it's spidering multiple websites. The email bot can follow each link for eternity, and never get "out" of the loop.

I used to have a working example of this on my website, but since I moved it to a new machine, I've been too lazy to get it working again. I'll try to get it working soon.

I hope to have it at http://arpa.org/raid/ or http://arpa.org/blackflag/ again soon.


-Erik Schorr

login or register to post comments

sticky honeypot

Submitted by rleir on October 23, 2001 - 07:09.

Just a suggestion. Apache needs a module which would work like LaBrea
-- http://hts.dshield.org/LaBrea/ --
which acts as a sticky honeypot. When a bad_bot is trying to access the fake /email-addresses/ page, LaBrea would 'delay' it for a long time.

Here is a quote:
LaBrea is a program that creates a tarpit or, as some have called it, a "sticky honeypot". LaBrea takes over unused IP addresses on a network and creates "virtual machines" that answer to connection attempts. LaBrea answers those connection attempts in a way that causes the machine at the other end to get "stuck", sometimes for a very long time.

login or register to post comments

wget

Submitted by donarb on October 24, 2001 - 00:45.

wget is a *nix utility. I wouldn't immediately ban it. It does have uses that are not hostile. It even has an option for the user to honor robots.txt.

My solution is to never put any email address anywhere on a site. If a user wants to contact someone, they go to a page with a custom script, type their email and the script is responsible for delivering it. This also has the benefit of centralizing all relevant email addresses in one place, easier to update should an address ever need to be changed.

Don

login or register to post comments

Another solution

Submitted by amos_b_haven on October 24, 2001 - 21:25.

I sometime sym link my robots.txt file to a large .iso file, since I don't pay for bandwidth. My theory is that in some situations, I don't want to have people view my sight who have any need for a robots.txt file. (They are in a class of users I'd rather just annoy, since I can't get rid of them.)

login or register to post comments

You've neglected to mention.

Submitted by Shatai on October 25, 2001 - 05:12.

Those of you who said that 'smart' bad bots would use common strings like Mozilla seem to have forgotten the author's trap. Do you honestly think that the average user is going to seek out the email-addresses pseudo folder? :). If you actually spend your day looking to 'explore' those folders, you are a sad, sad person.

login or register to post comments

the waiting game

Submitted by lawngnome on October 30, 2001 - 20:28.

one thing to try is to fake out an evil robot by creating lag, basically you would set up a trap that sends them into a subsection of your site that keeps making more resources to look at, while at the same time getting slower to load, not only do they get "stuck" in your slow loading trap but you also stop them from spidering off to someone elses site, think of it - after a long day of spidering while little billy is at school "WTF?!? 12 emails in 3 hours !!!"

login or register to post comments

Missing the point

Submitted by meheler on November 8, 2001 - 15:19.

Shatai -- you're missing the point. It doesn't matter if it's painfully obvious that a spider is specifically trying to the /email-addresses folder, no webmaster I know would shut out 90% of his visitors by blocking Mozilla/5.0 (compatible; MSIE 5.5; Win98) -- or whatever the string is -- if an evil spider switches to using it.

Fact is evil spiders have historically been pretty stupid and easy to detect. And to suggest that WGet should be blocked is rediculous. wget is a fantastic tool that honest people can use, and this article suggests that it should be blocked? Just because some guy decided to call wget externally instead of using internal routines.

In short: think before you speak.

login or register to post comments

Simple Solution

Submitted by J42 on November 9, 2001 - 09:30.

All of your solutions seem workable, but overly complex.. Why not write a serverside script which converts any detected email addys into an image (surely it must be trivial to convert a font into a simple greyscale bitmap?) that way, no spam bot can ever touch it and lazy people will just have to write in the link in their favourite email app.

login or register to post comments

Re: Simple Solution

Submitted by mwarden on November 9, 2001 - 22:34.

Allow me to be the first to say: ewwwwwww. ;-)

Firstly, that adds a good deal of processing for every page that has an email address on it (could EASILY be every page)

Secondly, that's X more requests the browser has to make and X more things it must download.

Thirdly, you lose any benefit of font/color styling with CSS. What if you want to change from gray arial font to dark green verdana? You gotta change your image-producing script.

Fourthly, it's bad kharma to create text-only images, especially when there's an alternative

Just my $0.02. Thanks.

login or register to post comments

...

Submitted by J42 on November 9, 2001 - 22:51.

Well being a greyscale, or even a black and white image, the background colour could be a transparent colour, so that leaves the background colour problem out of it. The text colour.. well, I dont know enuf about HTML to answer that.. but im sure theres a way around it.. also.. it cant be THAT pressing on the server to produce a simple black and white image.. The only other idea I had, was that of a database which stores email servers which are closed relay.., and doesnt validate open relay servers.. its sort of the opposite approach to blocking all hosts which spam.. instead, block _everyone_ except validated anti-spam friendly servers then have email clients use that database as a filter (optionally) also, you could add your friends / or *.domain.com to a local list which is additional to that database, so if you DO know people whose servers arent on the valid email server database, you can still message them.. And also make it that as soon as you send email to an address, it gets added to your local exceptions list aswell.. so if you email anyone, they can email back..

login or register to post comments

More info on Apache and bad robots

Submitted by killough on November 23, 2001 - 08:05.

I put together this page describing the methods I use to stop bad robots, since there is apparently a lot of interest in the topic lately:

http://www.leekillough.com/robots.html

No single method is perfect, so I present many different methods that I use in combination. They should complement the methods described here and elsewhere.

Enjoy!

login or register to post comments

Problems with the suggested .htaccess solution

Submitted by gandalf on November 30, 2001 - 09:00.

djc's article is a good start to making life harder for the evil spiderts. Die, spambots, die. I don't want to see Brittany naked. All my body parts are of adequate proportions, and I do not worry about herbal cures for anthrax.

However, I must point out for the sake of completeness that the adding User-Agent IDs to the .htaccess file does not have exactly the same effect as adding User-Agent IDs to the httpd.conf file. A custom .htaccess file with specifications like
SetEnvIfNoCase User-Agent "^WhateverUglyRobotNameHere" bad_bot
will create a 500 server error on a secure server under the https:// protocol. The error log will have a comment something like
SetEnvIfNoCase not allowed here.

login or register to post comments

Clarificattion

Submitted by gandalf on December 1, 2001 - 06:55.

The 500 server error is generated only if running a version of Apache older than version 1.3.13. Control of User-Agents by httpd.conf or by .htaccess works the same if the version of Apache is 1.3.13 or newer.

login or register to post comments

RE: problems with .htaccess

Submitted by djc on December 2, 2001 - 14:57.

gandalf -
Thanks for the heads up about the probs with pre 1.3.13 installations of apache, something I would never have through of!! Thanks!

login or register to post comments

Errr....

Submitted by Numbski on December 6, 2001 - 00:48.

The spam bots will have an easy time getting around this. Using PERL as an example, any program that that uses libwww can set it's User Agent variable:

my $ua = new LWP::UserAgent;
my $cookies = HTTP::Cookies->new; 	# Create a cookie jar
my $ua->cookie_jar($cookies);	# Enable cookies
my $ua->agent("Numbski's Annoying Leeching Prog/8.0");

That last line could very easily be changed to something like this:

my $ua->agent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; .NET CLR 1.0.3328)");

Then what are you going to do? You'd be filtering all Windows 2000 users running IE5.5.

Sorry, just ain't gonna fly. If I wanted to be purely evil I could quite easily write a spambot of under 100 lines that would evade all those tricks. :\

login or register to post comments

A Better Mousetrap

Submitted by JWSmythe on December 27, 2001 - 17:42.

Sorry, that just isn't going to work very well.

I an the SysAdmin for Voyeurweb. What you're doing for Email spiders, we were using against password scanners years ago. It worked for a few months.

When I started fighting against them, I'd see that they were using "scanner_x/v1.0" or whatever. Sure, you can block those, but they very quickly adapt to your solution. Soon enough, you'll see very very legitimate User Agent strings coming in. I've been watching them scan my servers, and see different User Agent strings on every request. So, even if I blocked one (or 1000), they'd come in with a new one next time.. What happens when you have to block "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)" Poof, you just killed every MSIE 5.5 user.

A better solution I've started using is this.. Find a good way to identify the bad user.. You've already found it. You've laid the trap of /email-addresses/ . They're following the links on your pages, trying to find Email addresses. What if you had a link on several pages to trap.shtml (or trap.cgi if that pleases you, but they may block it).. Hide it from view. Make it so it's impossible to click. But, the spiders will see it in the HTML, and try to visit it...

Now for the Linux solution. Anyone else will have to adjust one line for their OS..

We know their IP. $ENV{REMOTE_ADDR} . Set a firewall rule against them.

/sbin/ipchains -A input -p tcp -s $ENV{REMOTE_ADDR}/32 -d 0/0 -j DENY

Good news and bad news.

Good News, I've given you all the tools you'd need.

This won't work as user "nobody". Nobody can't set firewall rules. You'll have to find your own solution from here.. Maybe you can set a temp file with the IP for a cron job to see and set the firewall rule. Maybe you can suid your script (not a good idea)..

What I've also seen done very successfully is watching your logs. If you know that the most a reader can see within an hour is 30 pages, and you rotate your web server's log files every hour, then you can watch for an excess of 30 requests from a particular IP. If that's exceeed, block the IP. Once every minute or two isn't too bad for a log that's only ever going to be one hour hold.

If you're scanning the logs, you have all kinds of options. You can watch for the unclickable page to be requested, instead of the CGI option. You can watch for too many requests. You can do almost anything you want.. Read your logs frequently, and try to identify trends.

You'll probably want to unblock these at an interval too.. AOL proxies get used very very frequently, by both scanners and AOL users. It seems the AOL users get upset when they can't get to the site too.. If you never unblock them, you'll end up blocking every single AOL proxy, therefore every single AOL user.. [Insert obligatory AOL comment here]

If you know one particular network is used frequently to spider you, block them. I know of a network in China that has IP's in it blocked daily. I havn't ever had a legitimate login from there. That would be a canidate for perminant blocking.

Don't worry, this is an ongoing battle. As long a there's something restricting access, someone will try to get through (or around) it. The original author's ideas were great a few years ago.. This will be pretty much useless in another year or two, I'm sure. They'll find better techniques to abuse us...

login or register to post comments

RE: a better mousetrap

Submitted by djc on December 27, 2001 - 18:49.

All good points JWSmythe.. The follow-up to this article, which I'm working on right now, will deal with the concerns you and others have brought up - changing user-agents.

Once the offending bot enters the 'honey pot' like the made up /email-addresses directory, their IP, etc will be in the access_log. My solution is to have a cron job running a perl or shell script which will grep the log file at regular intervals for clients entering the honey pot, grok the IP address of the bot, email an administrator, and automatically set an env variable in the httpd.conf file(or an included one) so that it too is given the bad_bot variable. Apache will then return a 403 for that IP address. It's not full proof, but nothing is. :( Combined with other techniques though, it should provide another barrier to spam harvesting spiderts.

login or register to post comments

agency against-

Submitted by star9_local on January 1, 2002 - 13:46.

What really needs to happen is for web developers at large to post spider and bots IP addresses to a list and get the cooperation of large ISPs and null route them all worldwide. Or better yet make dead routes for them and pump them into route-servers worldwide. Kilonox

login or register to post comments

Bot Poisoning is good also

Submitted by timgray on January 4, 2002 - 11:14.

Be sure to add a bot poison page. I have a perl script I picked up somewhere (probably from freshmeat) that generates pages upon pages of fake email addresses. thus filling the bot with tons of crap email addresses. If you couple the articles suggestion with that, not only can you ferret out the baddies, but you have the satisfaction knowing that that bot harvested about 100 or so junk email addresses.

login or register to post comments

The solution - no e-mail addresses!

Submitted by Eeyore on January 18, 2002 - 12:34.

Guys, I think you are missing the point. Even if you detect that bot "Joe" is running through your server, you detect it AFTER it does it once - it already won! You have to keep the bot from EVER getting an address and the only way I know of to do that is to not put the addresses in the HTML. I wrote a script last week in PHP that lets you set up a list of Names / Addresses and then displays the NAME on the web page but useses both the name and the address on the e-mail message. The only way to get the e-mail addresses from it is if someone hacks your server and then you have other more serious problems. The source to the scritp is at:

http://www.arkie.net/~scripts/mailme/

login or register to post comments

Even better

Submitted by artie11 on February 4, 2002 - 05:23.

the best way to find spiders in a log file would be to actually look for access of the robots.txt file ? no pretty simple and ingenious way to pick these insidious bots up. aRTie11@noemailaddressprovidedbyrequest.com.au.tw

login or register to post comments

Based on an assumption

Submitted by MartinB on February 5, 2002 - 10:21.

Artie - that assumes that the spiders are GoodBots and actually read it at all... by definition, we're not talking about GoodBots...

login or register to post comments

Part II

Submitted by djc on February 18, 2002 - 17:14.

For those interested, the follow up to this article has been posted. Stopping Spambots II - The Admin Strikes Back.

It addresses most of the problems about this article's implementation that were brought up.. Thanks for all the feedback, look forward to more of it! :)

login or register to post comments

Options an ordinary user can take.

Submitted by speleolinux on February 18, 2002 - 21:35.

Hi all, I have learnt lots from reading all these methods to stop spam. Many of these methods are being implemented by web admins but it's helpful if the user that just has a web page can help in the war too by disquising their email address in the first place. I noticed some people using emails like "REmove the animal for my email: person@horse.nocompany.com so I wrote a little cgi script to do that automatically for my own web page and to have it be fun at the sametime. It's at http://www.speleonics.com.au/mikes/spamfooler.php Hopefully it will reduce the spam I get and it might be useful for others. Use it :-) Best wishes Mike

login or register to post comments

This is a *bad* idea...

Submitted by alane on February 27, 2002 - 15:20.

For example, wget (I'm a contributing developer to that project) will, by default, honor the robots.txt file. You have to specifically tell it to ignore it if that's what you want. If it's only grabbing what's disallowed, then someone wrote a script that gets the robots.txt, parses it, and then fires up wget again with those specific urls.

Wget is a generic http/ftp site mirroring program. It's also just useful to grab large files, since it will retry if the connection goes south on you. Labelling wget as a "bad_bot" is unfair. It's unscrupulous [anal sphincters] who specifically misuse it that are the problem.

I run wget, when I use it, with a user-agent string that can't be blocked without cutting off 1000s of Windows users, just because of webmasters who do obnoxious things like you are advocating. If I want to get a copy of some software docs that are only in HTML on a web site, then I have to resort to hiding the identity of the program if a webmaster has done what you suggest.

Just because the spammers are sociopaths is no reason for webmasters to behave in an equally offensive manner.

As another example of what not to do, there was an article in MSDN magazine about using statistical analysis to detect robots by inter-fetch timing. The author advocated banning the class C network containing the robot, since it'll probably come back with a different address next time due to DHCP. So for each twit, he cuts off 253 innocent users.

I added a feature to wget to randomize the delay between between fetches in response to his article.

login or register to post comments

WPOISON: a *good* solution

Submitted by alane on February 27, 2002 - 15:25.

See WPOISON Home Page for a solution that works.

It traps email-harvesters in an infinite loop of web pages containing links that actually link back to a new invocation of wpoison, and every email address on each page is faked.

They add tons of bad addresses to their lists, and once they hit the page, if they just keep spidering, they can never leave.

It also puts a delay in after generating each page to both slow the harvester down and prevent CPU burn on the server.

login or register to post comments

re: wget

Submitted by djc on February 27, 2002 - 15:43.

Thanks for the comments alane. The second part of this article, found here, moves on and provides examples on blocking the spambots based upon and IP address, and not upon User-Agents. That way, people that use tools like wget legitimetly are not blocked, while people harvesting with a tool like wget are.

Thanks again :)

login or register to post comments

Help I used the Spider Trap and banned myself

Submitted by WolfPup on December 17, 2002 - 04:42.

I used the spider trap described in the above article http://www.leekillough.com/robots.html and after many days figured out how to make it work on my system. Now I'm banned from my test server. I've cleared the bad bots list (rewritemap bad ...) written by the perl script, and deleted all the logs and even the runtime, but can't find where the server is drawing the info to ban me. there is no cache file that I know of, It is not enabled in the config, but I still get banned. Can someone clue me in to where the server gets the info after the badbots (mapfile) file is cleared ?

login or register to post comments

Follow up: I found it.

Submitted by WolfPup on December 17, 2002 - 22:13.

If anyone uses this spider trap and tests it, the solution I found to cure the ban is to change the address in the bad bots file. Not sure if this the right way but it worked. Deleting the file only caused it to rebuild and I was still getting banned, so I tried cheating it and it worked. I'm sure there is some way other than what I did, but I got frustrated and did a hit and miss on all the files.

login or register to post comments

Way to show email but stop bots from getting it...

Submitted by jefflee on October 3, 2003 - 06:01.

Create a jpg or gif file of your email address if you want to show your users. Yes lynx users and text browser types will not see it, but I would say that this would stop many many spamers. You would not beable to easily make that image a mailto: link but you could send it to a cgi script that did.

login or register to post comments

Using Java to protect e-mail from spamer

Submitted by julwh on October 9, 2003 - 20:59.

The java script to show the email address to users:

<SCRIPT LANGUAGE=JAVASCRIPT TYPE="TEXT/JAVASCRIPT">
     $name="your_name";
     $domain="your_domain.com";
     $email = $name + "@" + $domain;
     $link = "e-mail" || $email;
     document.write("<a href=\"mailto:" + $email + "\">" + $link + "</a>");
</SCRIPT>

login or register to post comments

Why so many bad advise and still on line?

Submitted by balour on December 7, 2003 - 04:35.

I read this article, and I read only bad advise on it. I don't understand why the article is rate at 4.48/5. The only only reason is the fact this site is probably run by another pro-spammers group.

login or register to post comments

My comment on the real bad robots ..

Submitted by evoltisok on March 9, 2004 - 06:26.

This article is sure a interesting subject as all webmasters has to deal with un wanted robots and spiders and trash and robots.txt file etc. Most tips are kind a out dated i thnk ?. I have recently purchased a perl script named: robot control pro 4.0 and it put a end to all unknown unwanted spider visits, hackers, brute force attacks, exploits all the shit all at once. it even stops spam robots collecting email addresses, i can allow and deny robot i want. It sends me onrequest email notifications allowing me to monitor search engines behaviors etc. Great program it really gives me full control on all robots and i set the rules on what is or is not allowed. without complex long .htaccess files. I do not remember the URL but search for RCP or robot control pro Greets Webmistres Mandy X

login or register to post comments

Halting a recent PHP worm in it's tracks with thi

Submitted by buro9 on December 29, 2004 - 02:08.

Just drop this into your httpd.conf file
SetEnvIfNoCase User-Agent ".*lwp.*" bad_bot


   Order Allow,Deny
   Allow from all
   Deny from env=bad_bot

It will check for user-agents that contain 'lwp', that is case insensitive. When matched it will deny access to PHP files. This can be dropped anywhere into a httpd.conf file. Note that you'd be stupid to then set your 403 file to be a PHP driven file! The point of this is to prevent the recent worm taking advantage of a PHP vunerability by denying it access to PHP files according to the file extension. This is a very good and cheap way to protect a whole server when you may not have been able to verify whether files in other virtual servers are vunerable.

login or register to post comments

Tsk, stripped brackets... very wise but annoying.

Submitted by buro9 on December 29, 2004 - 02:11.

Can't evolt just script that?
SetEnvIfNoCase User-Agent ".*lwp.*" bad_bot

<FilesMatch "\.php">
   Order Allow,Deny
   Allow from all
   Deny from env=bad_bot
</FilesMatch>

login or register to post comments

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.orgEvolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.