The week before last I was in Montreal, Quebec, Canada for PHP Quebec Conference 2005. One of the presentations I gave was about using PHP scripts instead of normally static files like
robots.txt, and it gratifyingly raised a couple of eyebrows.
robots.txt a dynamic file can have a definite impact on performance, but it lets you answer queries a little more specifically. For example, three nasty types of robots are:
- Those which scan without even asking for
- Those which request
robots.txtand either ignore it or use its information to crawl through areas which it explicitly forbids; and
- Those which look at the various clients defined in
robots.txtand then come back disguised as one with more access.
I'm not sure if there actually are any Type III robots out there, but if I can think of it I'm sure some perp somewhere already has as well.
robots.txt is a dynamic file, it can handle Type III
robots.txt. The fact that it asked for the file and then fell into a trap in a forbidden area is a dead giveaway. :-)
The nasty part of the whole process is reliably identifying the perp. In these days of cable and
User-Agent request header field) is unreliable because it's easily spoofed and often good software is used to do bad things.
Of course, sometimes the client ID is a dead giveaway, such as
User-Agent: EmailSiphon, or the IPA might be in a fixed range known to be assigned to people of debatable virtue, but for the most part a lot of eyeballing is going to be necessary to settle on rules. The area is fallow for enhancing the response scripts to understand client/IPA combinations, requests per time t, and other heuristics. And I'm even working on some of those. :-)