Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Real Web 2.0: Battling Web spam, Part 1

Assess visitor behavior and control workflow to reduce spam

Uche Ogbuji, Partner, Zepheira, LLC
Uche Ogbuji
Uche Ogbuji is Partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is also lead on the Jacqard agile methodology for team Web development, and the Versa RDF query language. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia.

Summary:  Spam on the Web is one of the biggest threats to a modern Web developer. The "bad guys" become more and more sophisticated every year in how to vandalize and proliferate ads over any Web 2.0 page they can grasp. To make matters worse, spam is increasingly used to distribute malware. The arms race is on, and Web developers need to know what basic tools are available to battle spam on their Web sites. This two-part installment provides a thorough guide to anti-spam techniques. This first article explains how to assess whether a visitor is a spammer and how to organize site workflow to discourage spam.

View more content in this series

Date:  02 Dec 2008
Level:  Intermediate PDF:  A4 and Letter (63KB | 7 pages)Get Adobe® Reader®
Also available in:   Chinese  Korean  Japanese

Activity:  14889 views
Comments:  

In 1994 the National Science Foundation ended its prohibition of commercial speech on the Internet. At that time, e-mail and Usenet were the main forums for communication, and simple publishing systems such as Gopher were trying to establish a wider user base. The Web had barely emerged. That year a law firm named Canter & Siegel posted the first mass commercial spam on Usenet, hiring a Perl programmer to generate advertisements for its "Green card lottery" services, then blasting these to over 6,000 newsgroups. They became celebrities as well as villains, and promptly launched themselves as an outfit producing spam for others, promoting the benefits of spamming, and writing books on "Internet marketing". Since then no online forum has been safe from inappropriate commercial advertising. As the Web has become a more and more important venue for online discussion, and as Web 2.0 techniques have opened up venues for people to write to the Web as much as to read it, the problem of spam has grown with a vengeance.

Sometimes Web spam is a minor nuisance, but more often it's a pernicious problem. Spammers will often not be content to post one or two messages, but will often flood forums with their messages, until it overwhelms the desired subject matter. Sometimes the spam includes pornographic or antisocial messages that discourage participation. Most search engines will devalue pages with such messages, or with links to sites associated with spam, which means that spam can reduce your search engine optimization. And the final injury is that Web publishers end up wasting resources on spam-fighting, taking time way from other tasks.

Web spam comes in many forms, including:

  • Spam articles and vandalized articles on wikis
  • Comment spam on Weblogs
  • Spam postings on forums, issue trackers, and other discussion sites
  • Referrer spam (when spam sites pretend to refer users to a target site that lists referrers)
  • False user entries on social networks

Dealing with Web spam is very difficult, but a Web developer neglects spam prevention at his or her peril. In this article, and in a second part to come later, I present techniques, technologies, and services to combat the many sorts of Web spam.

Spammer behavior

Online contributions by people come in irregular patterns, of timing and content. Spam is generally created by programs, like the Perl script that posted to 6,000 Usenet groups for Canter & Siegel. Sometimes you can use the mechanical patterns of these programs against the spammers. In cases where registration is required, you can use a variety of warning signs to flag an account for further review. For example, if the account is from a top-level domain such as .ru, .br, .biz, that's strongly associated with spammers. If an account is flagged, you might put a hold on the first few postings by that account and review them before releasing the hold.

Flood control

Wiki, Weblog, and forum spammers often send dozens of requests within seconds, and you can at least limit damage by limiting the number of requests one user or IP address can send within an interval. This technique is called flood control. You should make sure you control requests across the entire site in this way, and not just one page.

You can apply these behavior assessment techniques, and others in the next section, to prevent other types of abuse on the Web besides spam. For example, if you host a Webmail service, spammers might try to mass-create accounts from which they can send spam, or if you host an online auction site, people might write programs to manipulate the auction process. In general, anywhere you might have a system for registering new users, some of those users might try to find automated registrations for mischievous ends. In general, these techniques help Web developers ensure that people using their services are who they say they are.

Workflow control

Most spam comes from robots, so they have some characteristic quirks in their usage of a site that you can exploit. Figure 1 summarizes the difference between typical human and typical robot workflow in submitting a message or comment for addition to your site.


Figure 1. Typical human versus robot Web workflow
Typical human versus robot Web workflow

You can stop a great deal of spam by detecting typical robot workflow, which goes straight to the POST request.

Form variation

The first question you ask is whether the main form was loaded by the requester. One way to check is by varying aspects of the form in visible or invisible ways (to the user). The usual invisible trick is to vary the name of form fields. You might send the text area with the main content to one user with a field name of content_081010_68_45_76_45, using dates, user IP, and other such modifiers on a base name. You might not want to allow a variant field name to be reused. You can also make visible variations by requiring some users to check a box when posting, perhaps if other patterns indicate that the IP being used is suspicious.

The nonce test

A nonce is a hard-to-guess value generated for each page view that contains a form. Then you require that nonce as one of the fields in form submission. If the robot goes straight to POSTing the form submission, it will not have loaded the HTML page, and so it will not know the expected value for the nonce. There are many ways you can tune the specifics of a nonce test. In all cases, you want to be sure that a nonce cannot be reused, otherwise spammers will easily circumvent it. You might want to use the IP address and the date or full time at which the page request was made in order to generate and validate the nonce.

Detecting JavaScript

Some robots try to fool nonce tests by loading the page and reading the nonce. A careless spammer will just have the robot immediately send the POST, now that they have the required information. You might want to flag submissions that were made immediately after the form was loaded, with an interval that makes it improbable that a human was able to fill out the form inputs so quickly. This is related to flood-control tests. Even when robots are more careful, they often do not run any JavaScript on the form page. You can use this in several ways. First of all, you might not generate the nonce in the actual content of the form page, but rather through a secondary JavaScript request from the form page. You can set an event handler for the first time the user enters data into the main content field. Figure 2 illustrates.


Figure 2. Using JavaScript to improve the nonce test
Using JavaScript to improve the nonce test

The main problem with this is that some users disable JavaScript, and in fact some corporate policies require this, so you should provide secondary means for users to authenticate themselves, perhaps by using form variation only on users not invoking JavaScript. On the other hand, some spammers launch their attacks using scripts within browser engines, so they do invoke JavaScript. You might use JavaScript tests as a weighting factor in spam detection, rather than as an absolute determinant.


Test the H.Q. (human quotient)

One very popular approach to fighting Web spam is a variation on the nonce test. You include with the post form a visual test of some sort that's easy for most people but tricky for a robot. The most popular such method is CAPTCHA. In this test you present an image representing some alphanumeric characters that the user must read and type into a form field. Spammers use optical character recognition to defeat CAPTCHA, so there is usually significant distortion to the images. Figure 3 is an example of a CAPTCHA image.


Figure 3. A CAPTCHA image requiring the user to answer "smwm"
A CAPTCHA image

Even with the mild distortion in this example, spammers have been able to defeat such CAPTCHA, so their adversaries have resorted to greater and greater levels of distortion, as in figure 4.


Figure 4. A CAPTCHA image requiring the user to answer "following finding"
Another CAPTCHA image

There are several problems here. For one thing, as CAPTCHA gets more distorted to defeat spammers, it gets harder for people to read. And some people, such as the visually impaired, cannot see CAPTCHA images at all, so this technique exhibits poor accessibility, and can even be illegal in some circumstances. Despite these difficulties, CAPTCHA has become one of the most popular spam prevention devices.

Textual confirmation

A similar technique that doesn't suffer such accessibility problems is to present the user a random textual question. If the site has a specific domain, your questions might require basic knowledge of that domain. So on a medical news and information site, you might ask a question such as, "What is the primary organ associated with breathing?", expecting an answer of "lungs". With this technique it's important to have a wide variation of such questions, and to be careful with the wording so that people can generally answer, but robots cannot generally guess.


Wrap up

You can add subtlety to all the techniques I discussed in this article by forcing people to preview their postings. Just by adding this extra workflow stage you'll catch some spammers, and if you do it carefully, for example using JavaScript to automate preview for some users, the level of inconvenience shouldn't stifle contribution. You could apply CAPTCHA, form variation, nonces, and more, depending on workflow, which makes things even more complicated for spammers in a manner that most legitimate users won't even notice.

Assessing behavior and managing workflow are enough to reduce spam, but usually not enough to eliminate it. For example, some spammers hire people to circumvent all the controls discussed in this article (sometimes called a "mechanical turk" attack). They seek people in places where labor costs are cheap and pay them to go to a target site and leave a spam message by hand. To address mechanical turks, as well as the most sophisticated spam robots, you need to harness the power of the large community of people who hate spam as much as you do. I'll explain how you do so in the next installment of this series.


Resources

Learn

Discuss

About the author

Uche Ogbuji

Uche Ogbuji is Partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is also lead on the Jacqard agile methodology for team Web development, and the Versa RDF query language. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=355911
ArticleTitle=Real Web 2.0: Battling Web spam, Part 1
publish-date=12022008
author1-email=uche@ogbuji.net
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers