Skip to main content

skip to main content

developerWorks  >  Linux  >

Charming Python: Beat spam using hashcash

If they want to send spam, make them pay a price

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

David Mertz (mertz@gnosis.cx), Developer, Gnosis Software, Inc.

09 Nov 2004

Built on the widely available SHA-1 algorithm, hashcash is a clever system that requires a parameterizable amount of work on the part of a requester while staying "cheap" for an evaluator to check. In other words, the sender has to do real work to put something into your inbox. You can certainly use hashcash in preventing spam, but it has other applications as well, including keeping spam off of Wikis and speeding the work of distributed parallel applications. In this article, you'll meet David's own Python-based hashcash implementation.

The hashcash.org Web site (see Resources) indicates that the main function of the hashcash system is as a spam-filtering protocol:

Hashcash is a denial-of-service counter measure tool. Its main current use is to help hashcash users avoid losing email due to content-based and blacklist-based anti-spam systems.

In my opinion, though, the technique has wider applicability than just e-mail. This article illustrates hashcash's relevance to e-mail filtering, but also proposes some other uses. And it introduces my own Python implementation of hashcash (which appears to be the first correct published Python version) that is now included on the hashcash.org site. David McNab created a Python implementation that does not use quite the same protocol as hashcash itself; some other developers have also created incomplete Python versions of parts of hashcash.

Before getting to these topics though, let's review what hashcash is.

The basics of hashcash

The inspiration for hashcash is the idea that some mathematical facts are difficult to discover but simple to verify. A well-known example is factoring large numbers (especially numbers with few factors). It is cheap (CPU cycles are money, after all) to multiply some numbers together to find their product, but much more expensive to find those factors in the first place.

RSA public-key cryptography is based on this property of factorization. Answering a factorization challenge is proof that the respondent has done a quantifiable degree of work (or obtained the factors surreptitiously from the person who generated the composite).

Factorization works well enough for interactive challenges. Say I have an online resource that I want you to symbolically pay for. I can send you a message that says "I will let you have this resource once you factor this number." Dilettantes cannot have my resource, only those who prove they have enough interest to use some CPU cycles in answering a challenge.

Non-interactive challenges

However, some resources cannot easily be interactively negotiated.

One resource I rather value is my e-mail inbox. Uninvited messages take a bit of my disk space and bandwidth, but worst of all, they demand my attention. I do not mind strangers writing to me, but I would like them to be somewhat serious in their intent to contact me personally with a message that is valuable to me. At minimum, I don't want them to be spammers who mail an identical message to me and ten million other people in the hope that a couple of us will buy a product or fall for a scam.

To achieve a non-interactive "payment," hashcash lets me issue a standing challenge to anyone who wants to e-mail me. You have to include a valid hashcash stamp in the headers of your message; specifically, one that includes my recipient address within the stamp.

The way hashcash poses a challenge is by asking "minters" to produce strings (stamps) that when hashed with the Secure Hash Algorithm, SHA-1, have a number of leading zeros in their hash. The number of leading zeros discovered is the bit value of a particular stamp. Given the uniformity and cryptographic strength of SHA-1, the only known way to discover a hashcash stamp of a given bit value b is by running SHA-1 2^b times on average.

However, verifying a stamp requires just one SHA-1 computation. For use in e-mail, a 20-bit value is currently the recommended price: Senders need to perform about a million trials to find a valid stamp, which takes less than a second on the most recent CPUs and compiled applications. And it still takes only a few seconds on relatively old machines.

While we're at it, let's quickly look at the power of the SHA algorithm before we continue discussing the basics of hashcash.

How strong is SHA?

In what may prove a significant event in the cryptographic world, a collision for SHA-0 was discovered (see Resources for a link to Pascal Junod's e-mail, which gives details on the actual collision). The attack that was used required on the order of 2^51 steps, substantially shy of the 2^80 or so steps (and storage space) we would expect for a brute-force construction of a collision (under the "birthday paradox"; see Resources for a link to more information on the birthday paradox and how it applies to hash functions).

Two things to keep in mind before worrying too much about this attack in relation to hashcash is that the approach attacks SHA-0, not SHA-1 (yet). The second relative assurance is that 2^51 steps is still over 9 CPU years on today's fastest CPUs. Even if a similar approach could apply to SHA-1, it is unlikely that a construction of false collisions would be cheaper than constructing even a very large number of 20-bit stamps (or even of 40-bit hashcash stamps).

Back to our previous discussion.

The hashcash (version 1) format

Just requiring a special SHA-1 hash value is not enough. We also want the stamp to be specific to the resource requested -- that is, a stamp for mertz@gnosis.cx should have a different application than one for someuser@yahoo.com. If not, a spammer could mint just one high bit-value stamp and use it everywhere.

Also, once minted, I don't want a stamp to be shared among every spammer who wants to send me mail. Therefore, hashcash takes two extra steps (or at least recommends them as part of the protocol):

  • First, stamps carry a date. A user may decide to consider stamps older than a certain age invalid.
  • Second, a hashcash client may, and probably should, implement a double spend database.

A double spend database is one in which each stamp may be used exactly once; if it is received a second time, it is considered invalid (much as with a postage stamp after it is marked as processed). In full detail, a hashcash (version 1) stamp looks like the following code:

1:bits:date:resource:ext:salt:suffix

The stamp consists of seven fields.

  1. The version number (version zero is simpler, but has some limitations).
  2. The claimed bit value. If the stamp does not really hash with the purported leading zero bits, it is not valid.
  3. The date (and time) a stamp was minted. Stamps in the future and those too far in the past may be judged invalid.
  4. The resource for which a stamp is minted. Perhaps an e-mail address, but also possibly a URI or other named resource.
  5. Extensions that a specialized application may want. Any additional data could be placed in here, but in usage so far, this field is generally left empty.
  6. A random salt that distinguished this stamp from any other one minted for the same resource and date. For example, two different people may reasonably want to send e-mail to my same address on the same day. They should not be disqualified by my use of a double spend database. But if each of them uses a random salt, their complete stamps will differ.
  7. The suffix is the real work of the algorithm. Given the first six fields, a minter must try many sequential suffix values to produce a stamp that hashes with the desired number of leading zeros.

Now let's look at how hashcash works in e-mail.



Back to top


How hashcash works in e-mail (or intends to)

In an ideal world, all senders would include hashcash tokens in their messages; recipients would all check their validity upon receipt. But in real life, hashcash is not nearly so widely used. Nonetheless, starting to use hashcash (as either sender or recipient) will not break anything in existing e-mail tools. In other words, you have nothing to lose by using hashcash in e-mail.

To stamp an outgoing message, you simply add headers to your e-mail: one X-Hashcash header for each To: or Cc: recipient of the e-mail. For example, someone wishing to send me a message might include a header like this sample rfc2822 header:

X-Hashcash: 1:20:040927:mertz@gnosis.cx::odVZhQMP:7ca28

Obviously, MUAs (mail user agents), filters, or MTAs (mail transport agents) should do this work rather than requiring users to do it manually. Doing it manually, however, is not all that difficult, at least experimentally. Checking a stamp starts with just looking at its hash, like so:

$ echo -n 1:20:040927:mertz@gnosis.cx::odVZhQMP:7ca28 | sha
00000b50b85a61e7ba8ac4d5fed317c737706ae5

See the leading zeros (each hex digit is four bits). Of course, it is also worth checking that the resource is one you recognize (like one of your recipient addresses), that the stamp was not spent before, and that the date is current. Also, a valid stamp should have as many leading zeros as it purports to have (but you may decide to impose your own minimal price to let mail through: Twenty bits is semi-standard, though it could eventually change with Moore's Law).



Back to top


Why does this work?

It only takes a couple of seconds to mint a 20-bit stamp. Not a big price when you send a few dozen e-mails in a day. But a couple of extra CPU seconds per message is prohibitive to spammers who want to send millions of messages. There are only 86,400 seconds in a day. Even if spammers utilize zombie machines they have infected with trojans, requiring individualized hashcash stamps at least slows down the traffic out of those zombies. Checking a stamp, of course, takes a tiny fraction of a second.

On the other hand, adding hashcash minting and checking to your own MUA -- unlike some other anti-spam measures -- has no negative effect on anyone else. For recipients who do not use the protocol, it is just an extra header they can easily ignore. For senders who do not add hashcash stamps, recipients who check for X-Hashcash: just do not have anything to check. If senders do not add stamps, you are no worse off for checking; you just aren't better off, either.

A good MUA or spam-filtering system might whitelist e-mails with valid hashcash stamps. Or even more subtly, SpamAssassin gives more +ve points for more valid hashcash bits. To my mind, a hashcash-based approach to whitelisting is an improvement on the interactive challenge systems like TMDA -- challenge messages do not get lost on return, and senders do not forget to respond to challenges. The challenge response is right in the original message (as a hashcash stamp).



Back to top


Other uses for hashcash

Hashcash is most useful for non-interactive challenges. But there is no reason why it cannot be used in an interactive context as well. As more tools add support for hashcash -- especially multi-purpose applications like the Mozilla suite -- it becomes simpler to use hashcash neutrally across both interactive and non-interactive situations.

For example, if Thunderbird mailer gains API calls for hashcash computations, it should be straightforward to let its sibling Firefox Web browser respond to interactive challenges using the same API to produce hashcash stamps.

What is Wiki?

Wiki is the "simplest online database that could possibly work." It supports hyperlinks and possesses a simple text syntax designed for creating new pages and crosslinks between internal pages on the fly.

Wiki is server software that lets users freely build and edit Web page content using a browser, providing a kind of "open editing" that promotes an unusual group communication mechanism. It not only allows any user to edit the content of a page, it lets users edit the organization of contributions to a page or site.

To learn more about Wiki, follow the link to "What is Wiki," in Resources.

Protecting the Wiki

One non-e-mail context in which hashcash appears to be a good solution is in the rather spam-like defacement that Wikis sometimes suffer. Since Wikis are generally open for anyone to edit, a pox on the Wiki community is Wiki-crawling vandalization programs that add irrelevant commercial links to Wiki sites.

A Wiki I help maintain recently suffered repeated vandalism that forced us to the somewhat undesirable response of requiring user accounts for all posters. These accounts are given out on a non-discriminatory basis, based on an automatic e-mailed challenge to send back a message to prove you received a random key. But requiring accounts at all goes against the spirit of Wikis.

Adding a hashcash challenge does not prevent automated defacement of Wiki sites, but it can make a vandalbot crawl much more slowly. If vandalizing one site takes a number of seconds rather than a small fraction of a second, it becomes less appealing to crawl Wikis spewing junk. In fact, for a use like this, my opinion is that requiring more than 20 bits is a good idea. Maybe 24 or 28 bits is a reasonable burden (which might still be bypassed for logged-in Wiki users).

You might think that a simple time delay in accepting a Wiki edit would have a similar effect, but there is a flaw in that thinking. A vandalbot can parallelize its defacements -- if each site adds a five-second delay, for example, the vandalbot can simply spend that five seconds initiating changes to other Wikis on its hit list. By requiring active CPU utilization, as with hashcash, vandalization can no longer run in parallel.

A Wiki challenge can be either interactive or non-interactive. It is possible for a site to direct a user to a challenge screen before directing them to the actual edit screen. A random resource could be issued as the challenge on this guardian screen.

But a better approach is to lose the requirement for interactivity. For example, in an existing Wiki system, you might edit a resource using a URL like the following:

http://somewhere.net/wiki?action=edit&id=SomeTopic

Under a hypothetical hashcash-protected Wiki, you might need to use a different URL, such as:

http://somewhere.net/wiki?stamp=1:24:040928:SomeTopic:edit:KG4E9PaK2VLjKM2Z:0000Zbrc

The Wiki server can check the stamp before it allows edits. But editing does not require creating an account or disclosing any personal information. Double spending and (probably short-duration) expiration checks further assure sincere interest in the edit. It was not difficult for me to generate the above URL using the command:

hashcash -mCb 24 -x edit SomeTopic

Ideally, however, a Web browser might choose to generate likely stamps in the background to assure fewer delays. For example, the above URL might be created in a cache while I am reading the resource:

http://somewhere.net/wiki?SomeTopic

Perhaps some other edit stamps could also be cached for pages linked to by the current Wiki page.

Proving CPU resources

An interactive use for hashcash might be in distributed processing tasks. Projects such as the Great Internet Mersenne Prime Search (GIMPS) and SETI@home and tasks such as protein folding and cryptographic puzzles, to name a few, are sometimes farmed out to a large number of volunteer machines. Each volunteer just downloads some code and runs its part of a large task, sending back intermediate computations to a central server. These jobs are a nice use of spare CPU cycles.

All the distributed tasks I know of pretty much let anyone join. But it is not hard to imagine a task whose coordination requirement is such that a node failing to perform its task in an anticipated time frame causes more harm to the overall computation than the work the slacking node contributes.

In such a case, you might want to require that each participating node have a minimal CPU speed. While checking the speed using the exact type of computation at issue is more precise, hashcash still provides a relatively general CPU benchmark. SHA-1 is a fairly typical mathematical computation. And if candidate nodes already have hashcash installed (but not some custom software tool), answering a hashcash challenge can act as a sort of "you must be this tall to enter this ride" style check.

The trick in checking CPU capability is to demand a high bit value with a short expiration. Only a fast enough CPU can answer the challenge. To make this work, a resource name has to be provided semi-interactively -- otherwise, a candidate could just post-date their datestamp to give a false impression of rapid creation.

For example, a fast Pentium III or slow G4 can mint a 20-bit stamp in less than a second, but a Pentium-II or G3 cannot. We can issue 32-bit challenges that must be answered within an hour to candidate machines as a screening test. A requester might send an e-mail reading, "Send me a challenge"; the coordinating server responds with, "The time is 040927124732; your challenge resource is a37tQk." If the server gets a good hash by 1:47pm that day, the requester is qualified.

Clearly, the protocol I suggest does not assure the work actually gets done on each node. Even fast machines can be unplugged. And users can change their minds about running the distributed software. But at least a plausible qualification can be demonstrated.



Back to top


Generalized hashcash and my contribution

Given the overall concept of hashcash, the use of its specific fields and delimiters is somewhat arbitrary. In fact, hashcash version 0 used different fields than version 1. The choices made are good, but my opinion is that "actual hashcash" is just one member of the family we might call "generalized hashcash." That is, given any challenge string, it is sensible to ask, "Show me a suffix that will produce b bits of collision once challenge+suffix is hashed." Real hashcash is just an example of this generalized challenge.

Now there is a problem with being too general. Creating lots of incompatible, almost-hashcash protocols really does no one any good. For example, one Python implementation of "hashcash" used a challenge protocol that was a little bit like hashcash (and might well be equivalent in cryptographic merit), but it was just not possible to mint hashcash stamps with it.

So, I decided to write a Python implementation of hashcash that was genuinely conformant and even accepts roughly the same command-line switch as the C-coded hashcash utility (but is probably most useful as an imported module in other applications). Even with platforms that benefit (only slightly) from Psyco-ization, the Python version winds up running almost ten times more slowly than the optimized C version. But it still wins in flexibility compared with C.

As well as being correct, my hashcash.py module provides an internal function, _mint(), along with the public function, mint(). The latter makes real hashcash version 1 stamps. That is what you should use.

But the former, _mint(), does the underlying work of finding generalized hashcash suffixes. You probably should not use it, but if you want it (and you promise to play carefully), it is there for your use.

In novel contexts, variations on hashcash might be useful. For what it is worth, I wish the C utility had a similar switch to find generalized hashcash suffixes, even if accompanied by dire warnings in the man pages about why you should not do that. We hackers like to poke at the guts of things.



Back to top


Summing up

I hope this article has given you a sense of the likely applications of hashcash. I think the protocol is a wonderfully clever idea. The challenge now is to get more tools to process hashcash stamps more seamlessly.

A number of MUAs, MTAs, and spam-filtering tools already do a good job of working with hashcash, but significant gaps still exist. Hardly any non-e-mail applications utilize hashcash. Still, I believe the concept is catching on.

If the concept is gaining momentum, it offers a means of regulating access to electronic resources that is consistent with the gestalt of Free Software and open standards and does not slip us into the evils of digital restrictions management (DRM), commercialization of information, and a general loss of privacy.



Resources



About the author

David Mertz is Turing complete, but probably would not pass the Turing Test. For more on his life, see his personal Web page. He's been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book Text Processing in Python.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top