In 1994 the National Science Foundation ended its prohibition of commercial speech on the Internet. At that time, e-mail and Usenet were the main forums for communication, and simple publishing systems such as Gopher were trying to establish a wider user base. The Web had barely emerged. That year a law firm named Canter & Siegel posted the first mass commercial spam on Usenet, hiring a Perl programmer to generate advertisements for its "Green card lottery" services, then blasting these to over 6,000 newsgroups. They became celebrities as well as villains, and promptly launched themselves as an outfit producing spam for others, promoting the benefits of spamming, and writing books on "Internet marketing". Since then no online forum has been safe from inappropriate commercial advertising. As the Web has become a more and more important venue for online discussion, and as Web 2.0 techniques have opened up venues for people to write to the Web as much as to read it, the problem of spam has grown with a vengeance.
Sometimes Web spam is a minor nuisance, but more often it's a pernicious problem. Spammers will often not be content to post one or two messages, but will often flood forums with their messages, until it overwhelms the desired subject matter. Sometimes the spam includes pornographic or antisocial messages that discourage participation. Most search engines will devalue pages with such messages, or with links to sites associated with spam, which means that spam can reduce your search engine optimization. And the final injury is that Web publishers end up wasting resources on spam-fighting, taking time way from other tasks.
Web spam comes in many forms, including:
- Spam articles and vandalized articles on wikis
- Comment spam on Weblogs
- Spam postings on forums, issue trackers, and other discussion sites
- Referrer spam (when spam sites pretend to refer users to a target site that lists referrers)
- False user entries on social networks
Dealing with Web spam is very difficult, but a Web developer neglects spam prevention at his or her peril. In this article, and in a second part to come later, I present techniques, technologies, and services to combat the many sorts of Web spam.
Online contributions by people come in irregular patterns, of timing and content. Spam is generally created by programs, like the Perl script that posted to 6,000 Usenet groups for Canter & Siegel. Sometimes you can use the mechanical patterns of these programs against the spammers. In cases where registration is required, you can use a variety of warning signs to flag an account for further review. For example, if the account is from a top-level domain such as .ru, .br, .biz, that's strongly associated with spammers. If an account is flagged, you might put a hold on the first few postings by that account and review them before releasing the hold.
Wiki, Weblog, and forum spammers often send dozens of requests within seconds, and you can at least limit damage by limiting the number of requests one user or IP address can send within an interval. This technique is called flood control. You should make sure you control requests across the entire site in this way, and not just one page.
You can apply these behavior assessment techniques, and others in the next section, to prevent other types of abuse on the Web besides spam. For example, if you host a Webmail service, spammers might try to mass-create accounts from which they can send spam, or if you host an online auction site, people might write programs to manipulate the auction process. In general, anywhere you might have a system for registering new users, some of those users might try to find automated registrations for mischievous ends. In general, these techniques help Web developers ensure that people using their services are who they say they are.
Most spam comes from robots, so they have some characteristic quirks in their usage of a site that you can exploit. Figure 1 summarizes the difference between typical human and typical robot workflow in submitting a message or comment for addition to your site.
Figure 1. Typical human versus robot Web workflow
You can stop a great deal of spam by detecting typical robot workflow, which goes straight to the POST request.
The first question you ask is whether the main form was loaded
by the requester. One way to check is by varying aspects of the form in visible or
invisible ways (to the user). The usual invisible trick is to vary the name of form
fields. You might send the text area with the main content to one user with a field
dates, user IP,
and other such modifiers on a base name. You might not want to allow a variant field
name to be reused. You can also make visible variations by requiring some users to
check a box when posting, perhaps if other patterns indicate that the IP being used
The nonce test
A nonce is a hard-to-guess value generated for each page view that contains a form. Then you require that nonce as one of the fields in form submission. If the robot goes straight to POSTing the form submission, it will not have loaded the HTML page, and so it will not know the expected value for the nonce. There are many ways you can tune the specifics of a nonce test. In all cases, you want to be sure that a nonce cannot be reused, otherwise spammers will easily circumvent it. You might want to use the IP address and the date or full time at which the page request was made in order to generate and validate the nonce.
Test the H.Q. (human quotient)
One very popular approach to fighting Web spam is a variation on the nonce test. You include with the post form a visual test of some sort that's easy for most people but tricky for a robot. The most popular such method is CAPTCHA. In this test you present an image representing some alphanumeric characters that the user must read and type into a form field. Spammers use optical character recognition to defeat CAPTCHA, so there is usually significant distortion to the images. Figure 3 is an example of a CAPTCHA image.
Figure 3. A CAPTCHA image requiring the user to answer "smwm"
Even with the mild distortion in this example, spammers have been able to defeat such CAPTCHA, so their adversaries have resorted to greater and greater levels of distortion, as in figure 4.
Figure 4. A CAPTCHA image requiring the user to answer "following finding"
There are several problems here. For one thing, as CAPTCHA gets more distorted to defeat spammers, it gets harder for people to read. And some people, such as the visually impaired, cannot see CAPTCHA images at all, so this technique exhibits poor accessibility, and can even be illegal in some circumstances. Despite these difficulties, CAPTCHA has become one of the most popular spam prevention devices.
A similar technique that doesn't suffer such accessibility problems is to present the user a random textual question. If the site has a specific domain, your questions might require basic knowledge of that domain. So on a medical news and information site, you might ask a question such as, "What is the primary organ associated with breathing?", expecting an answer of "lungs". With this technique it's important to have a wide variation of such questions, and to be careful with the wording so that people can generally answer, but robots cannot generally guess.
Assessing behavior and managing workflow are enough to reduce spam, but usually not enough to eliminate it. For example, some spammers hire people to circumvent all the controls discussed in this article (sometimes called a "mechanical turk" attack). They seek people in places where labor costs are cheap and pay them to go to a target site and leave a spam message by hand. To address mechanical turks, as well as the most sophisticated spam robots, you need to harness the power of the large community of people who hate spam as much as you do. I'll explain how you do so in the next installment of this series.
- If you think you can stomach the story, read about pioneering spammers Canter & Siegel. And find more background information in the Spam article.
- The Wikipedia article on CAPTCHA discusses history, variants, and implementations.
- Find a good gathering of useful information in the Wikipedia articles on Spam in blogs.
- Many developers of Weblog or forum software have great resource pages for fighting spam, including Six Apart (makers of MovableType) and Wordpress.
- In "Charming Python: Beat spam using hashcash", by David Mertz, learn about the hashcash technique for minimizing spam on Wikis and such, in addition to e-mail.
- Expand your site development skills with articles and tutorials that specialize in Web technologies in the developerWorks Web development zone, including previous installments of this column.
- Stay current with developerWorks technical events and webcasts.
- Participate in developerWorks blogs and get involved in the developerWorks community.