Been doing some digging around on this lately....looking at my logs and feedback from other folks that monitor this type of thing, I'm starting to be suspicious that anywhere between 25 to 50% of web traffic on my server (and probably yours) is from bots or scrapers - not human visitors, and not even Google/Yahoo/MSN. Yes, that much. Half my freakin' traffic.
I've done some reading on how to stop this at the website level. You can set up honeypots like:
- include a file in robots.txt that isn't linked to from anywhere else. Some bots will go there immediately - so anyone reading that file is a bot and can be banned.
- include a 1X1 pixel link as a honeypot
- check for speed of page requests, and whether they're requesting css/image files and other stuff bots don't care about.
- check useage at the top and bottom of every page. if at the bottom of a page the same visitor has requested another page, then they're requesting multiple pages at the same instant - another bot to be banned.
However I'd like to take some action at the server level instead of the website level. Maybe via apache or something. Is anyone doing anything like this at the server level?
TIA
Bot/scrapers - stop them at the server level?
Moderator: Admins
-
- Site Admin
- Posts: 713
- Joined: Wed Jun 12, 2002 5:57 pm
- Location: Toronto, Ontario
- Contact:
Re: Bot/scrapers - stop them at the server level?
Theres not a lot of good ways, or reasons to do this unfortunately.
The robots.txt file is specifically placed for you to specify which files you do and do not want non-human crawlers to index. Thus simply filling it correctly is a good start (instead of banning on read).
Beyond this, the crawlers only grab text generally speaking, so load, bandwidth, etc. are usually of minimal concern. Blocking all automated crawlers will have a massive impact on your websites reach (when search engines cant index your content) and is thus generally not recommended (or sought out).
The robots.txt file is specifically placed for you to specify which files you do and do not want non-human crawlers to index. Thus simply filling it correctly is a good start (instead of banning on read).
Beyond this, the crawlers only grab text generally speaking, so load, bandwidth, etc. are usually of minimal concern. Blocking all automated crawlers will have a massive impact on your websites reach (when search engines cant index your content) and is thus generally not recommended (or sought out).
Not so sure about that anymore - that's why I'm investigating. I had a look at one site recently and a rough guess shows it was likely putting out 25-50 gigs in one month to scrapers. That's excluding the search engines - just places in korea, china, and the old iron curtain countries that were raping the content.reasons to do this unfortunately.
I wasn't exaggerating on the 25-50%. That's what its looking like - quarter to half of the traffic running through your systems is stuff if folks looked at would ban. that'd likely save you some money . And stop the problem I get every couple of weeks where some idiot runs so many mysql threads that my server ceases operating for 1/2 an hour.
I think the reason it's not being done is because folks haven't realized just exactly how much this is going on - and it's growing. Two years ago I wouldn't have bothered.
Oh well, I guess I'll have to figure something out at the site level. One good thing about having a colo server instead of a site is that I can coordinate this stuff across all my sites at once .