IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
      
     Home      Products      Services & solutions      Support & downloads      My account     

developerWorks > Security >
developerWorks
Make your software behave: Cryptography essentials
e-mail it!
Contents:
One-way functions
Hash functions
Internet distribution
Authentication problems
Message authentication codes
Playback attacks
Telnet protocol
Other attacks
Digital signatures
The problem with digital signatures
What's next?
Resources
About the authors
Rate this article
Subscriptions:
dW newsletters
dW Subscription
(CDs and downloads)
Using hashing algorithms for data integrity and authentication

Gary McGraw, Reliable Software Technologies
John Viega, Reliable Software Technologies

01 Jul 2000

In this series on cryptography thus far, Gary and John have touched on both common forms of cryptographic algorithms -- public key cryptosystems, such as RSA, and symmetric algorithms, such as DES -- the most common ways to address the data confidentiality problem. They also discussed the importance of using well-understood algorithms instead of rolling your own, and introduced the risks commonly encountered when implementing cryptography in an application. In this installment, they turn to common approaches to data integrity and authentication, starting with hashing algorithms.

One-way functions
Hashing algorithms are one-way functions. That is, they take a plaintext string, and transform it into a small piece of ciphertext that cannot be used to reconstruct the original plaintext. Clearly, some data need to be lost in the transformation for this sort of function to work.

One-way functions may not sound immediately useful because you can't get the plaintext back out of a one-way computed ciphertext. Why would you want to compute a cipher that you can't undo? Of course, functions that are almost one-way are pretty useful, as all public key functions are, at their essence, one-way functions with "trap doors." Functions that are good candidates for public key cryptography are those that are easy to compute in one direction, but very difficult to compute in the other direction unless you know some secret. Thus, we find public key algorithms are based on factoring and other hard mathematical tricks.

Hash functions
As it turns out, true one-way functions are useful, too. These functions are often called hash functions, and the results of such functions are commonly called a cryptographic hash values, cryptographic checksums, cryptographic fingerprints, or message digests. Such functions play a major role in many cryptographic protocols.

The idea is to take a piece of plaintext and convert it to a piece of (usually smaller) ciphertext in a way that is irreversible. Ideally, all possible plaintexts will hash to a unique ciphertext, but that is not usually what happens. Most of the time, there are infinitely many different strings that can produce the exact same hash value. However, with a good cryptographic hashing function, it should be extremely difficult in practice to find two intelligible strings that hash to the same value. Another property of a good hash function is that the output will not reflect the input in any discernable way.

Hash functions usually produce constant-sized digests. Many algorithms produce very small digests, however, the security of the algorithm rests largely in the size of the resulting digest. We recommend selecting an algorithm that provides a digest size of no fewer than 128 bits. SHA-1, which gives a 160-bit hash, is a good hash function to use.

You can use hash functions to ensure data integrity, much like a traditional checksum. If you publicly release a regular cryptographic hash for a document, anyone can verify the hash, assuming they know the hashing algorithm. Most hashing algorithms that people use in practice are published and well-understood. Again, let us remind you that using proprietary cryptographic algorithms, including hash functions, is usually a bad idea.

Internet distribution
Consider the case of a software package distributed over the Internet. Software packages available through ftp were, in the recent past, associated with checksums. The idea was to download the software, then run a program to calculate your version of the checksum. The self-calculated checksum could then be compared to the checksum available on the ftp site to make sure the two matched and ensure data integrity (of sorts) over the wire. The problem is this old-fashioned approach is not cryptographically sound. For one thing, with many checksumming techniques, modifying the program to download maliciously, while causing the modified program to yield the exact same checksum is possible. For another, a "Trojaned" version of a software package can easily be published on an ftp site with its associated (poorly protected) checksum. Cryptographic hash functions can be used as drop-in replacements for old style checksum algorithms. They have the advantage of making tampering with posted code extremely difficult.

Be forewarned -- there is still a problem with this distribution scheme. What if you, as the software consumer, somehow download the wrong checksum? For example, let's say that we distribute the "xyzzy" software package. One night, some cracker breaks into the distribution machine and replaces the xyzzy software with a slightly modified version, including a malicious Trojan horse. The attacker also replaces our publicly distributed hash with the hash of the Trojaned copy of the release. Now, when some innocent user downloads the target software package, a malicious copy is what arrives. The victim also downloads a cryptographic checksum, and tests it against the software package. It checks out, and the malicious code appears safe for use. Obviously, the hash alone is not a complete solution if we can't guarantee that the hash itself has not been modified. In short, we need a way to authenticate the hash.

Authentication problems
Two possible situations can arise when we consider the authentication problem. We may want everyone to be able to validate the hash. If so, we can use digital signatures based on PKI, discussed below. Or, we may want to restrict who can validate the hash. For example, say we send out an anonymous letter to the sci.crypt newsgroup posting the full source code to a proprietary encryption algorithm, but we want only our closest friend to be able to verify that we posted the message. We can use a message authentication code (MAC) to achieve these results.

Message authentication codes
MACs work by leveraging a shared secret key, one copy of which is used on the recipient end. This key can be used to authenticate the data in question. The sender must possess the other copy of the secret key. A MAC can work several ways. The first is to concatenate the secret key to the end of the data before computing a digest. If you don't have the secret key, there's no way you can confirm the data is unaltered. Another more computationally expensive approach is to compute the hash like normal, and then encrypt the hash using a symmetric algorithm (like DES). To authenticate the hash, it must first be decrypted.

Applied Cryptography
Many constructions for MACs do not depend on cryptographic hashing. Bruce Schneier's Applied Cryptography covers this topic in more detail (see Resources).

MACs are useful in lots of other situations, too. If you need to perform basic message authentication without resorting to encryption (perhaps for efficiency reasons), MACs are the right tool for the job. Even when you are already using encryption, MACs are an excellent way to ensure that the encrypted bit stream is not maliciously modified in transit.

If designed carefully, a good MAC can help solve other common protocol problems. One widespread problem with many protocols is evidenced during what is called a playback attack (or a capture-replay attack). Say we send a request to our bank, asking to transfer $50 into John Doe' s bank account from ours. If John Doe intercepts that communication, he can later send an exact duplicate of the message to the bank! In some cases the bank might believe we sent two requests!

Playback attacks
Playback attacks turn out to be a widespread problem in many real-world systems. Fortunately, they can be mitigated with clever use of MACs. In our bank transfer example, assume we use a primitive MAC that hashes the request along with a secret key. To thwart playback, we can make sure that the hash never comes out the same. One obvious way to do this is to use a timestamp.

Be careful!
If you aren't careful with your code, there's another possible subtle mistake. If an attacker can easily pick out the bits of your message that correspond to the encrypted time and replace them with a different, random set of bits, your code may not work. If your code only checks to see if the timestamp is more recent than 60 seconds ago, and if it doesn't check to see if the timestamp is from the future, then the attacker will get lucky (and steal funds) way too often.

If the server ever sees a request with an out-of-date timestamp (say, more than 60 seconds old), it refuses the request. This may or may not be sufficient, because it still results in a 60-second window in which playback attacks can occur. You might consider forbidding two requests from occurring in the same unit of time, and cache information about what valid requests have arrived in the past 60 seconds. That solution will probably work if you can handle the special case where the same transaction is performed twice in the same unit of time. There's a much easier solution, though. When you calculate your MAC, hash not only the data and the secret key, but also a unique, ordered sequence number. The remote host need only keep track of the last sequence number it serviced, and make sure that it never services a request older than the next expected sequence number. This is a common approach.

In many cases, authentication may not really be an issue. For example, consider using a cryptographic hash to authenticate users logging into a machine from the console. When the user enters a passphrase for the first time in many systems, the passphrase itself is never actually stored. Instead, a cryptographic hash of the passphrase is stored. That's because most users feel better if they believe system administrators can't retrieve their password at will. Assuming the operating system can be trusted (which is a laughably big assumption), we can assume our database of crypto hashed passphrases is correct. When the user tries to log in, and types in a passphrase, the login program hashes it, and compares the newly hashed passphrase to the stored hash. If the two are equal, we assume the user typed in the right passphrase, and login proceeds.

Telnet protocol
Unfortunately, architects and developers sometimes assume that the security of the authentication mechanism isn't really a problem, when in reality it is. For example, consider the telnet protocol. Most telnet servers take a username password as input. They then hash the password, or perform some similar transformation, and compare the result to a local database. The problem is that with the telnet protocol, the password goes over the network in the clear! Anyone who can listen on a network line with a packet sniffer can discover the password. Telnet authentication provides a very low bar for potential attackers to clear. Many well-known protocols have a similarly broken authentication mechanism, including most versions of FTP, POP3, and IMAP.

Of course, many other important security considerations surround the implementation of password authentication systems. This problem is a big enough topic that we'll discuss it in even more gory detail in another article.

Other attacks
Any good cryptographic hashing algorithm should make finding a duplicate hash for an alternative plaintext difficult, even given a known message and its associated message hash. Searching for a collision on purpose amounts to a brute-force attack, and is usually fairly difficult. This is especially true if the attacker wants the second plaintext document to be something other than a string of gibberish.

Another attack on cryptographic hashes is much easier to carry out than the average brute-force attack. Consider the following scenario: Alice shows Bob a document and a cryptographic hash to validate the document where Alice agrees to pay Bob $5 per widget. Bob doesn't want to store the document on his server, so he just stores the cryptographic hash. Alice would like to only pay $1 per widget, so she would like to create a second document that gives the same hash value as the $5 one, then go to court, claiming that Bob over-billed her. When she gets to court, Bob will present a hash value, believing that Alice_s document will not hash to that value, because it isn't the original document she showed him. If her attack is successful, Alice will be able to demonstrate that the document she has does indeed hash to the value Bob stored, and the courts will rule in her favor.

But how? Alice uses what is called a birthday attack. In this attack, she creates two documents, one with the $5 per widget price, the other with the $1 per widget price. Then, in each document, she identifies n places where a cosmetic change can be made (for example, places where a space could be replaced with a tab). A good value of n is usually one more than half the bit length of the resulting hash output (so n = m/2+1 if we assign m to the bit length of the hash output). For a 64-bit hash algorithm, she would select 33 places in each document. She then iterates through the different permutations of each document, creating and storing a hash value. Generally, she will expect to find two documents that hash to the same value after hashing about 2 m/2 messages. That's much more efficient than a brute-force attack, where the expected number of messages she would have to hash is 2 m-1 . If it would take Alice a million years to perform a successful brute-force attack, she could probably perform a successful birthday attack in under a week. As a result, Bob should demand that Alice use an algorithm that gives a digest size for which she cannot perform a birthday attack in any reasonable time.

To get the same security against birthday attacks as you'd want against brute-force attacks, given a symmetric cipher with key length p, you should pick a hashing algorithm that gives a digest of size p*2. Therefore, for applications that require very high levels of security, it is a good idea to demand hashing algorithms that yield message digests that are 256 bits, or even 512 bits.

What's a good hashing algorithm to use? We're particularly fond of SHA-1. Bruce Schneier recommends the same algorithm. But SHA-1 is not sufficient if a hash of length longer than 160 bits is required. For a big bit hash, try using a symmetric cipher adapted to perform hashing. One good example is the GOST hash algorithm, derived from the GOST cipher, with a hash length of 256 bits. Longer hash lengths are likely to require some coding in order to adapt a symmetric cipher. Schneier outlines constructions for building such algorithms in Applied Cryptography. Neither SHA-1 nor the GOST hash algorithm has intellectual property restrictions of any significance.

Digital signatures
The idea behind a digital signature is analogous to a traditional hand-written signature. The idea is to be able to "sign" digital documents in such a way that the signature is as legally binding in court as a physical signature. Digital signatures must be at least as good as hand-written signatures at meeting these main goals of signatures:

  • A signature should be proof of authenticity. Its existence on a document should be able to convince people that the person whose signature appears on the document signed the document deliberately.
  • A signature should be impossible to forge. As a result, the person who signed the document should not be able to claim this signature is not theirs.
  • After the document is signed, it should be impossible to alter the document without detection, so that the signature can be invalidated.
  • It should be impossible to move the signature to another document.

Even in hand-written signatures, these ideals are merely conceptual, and don't really reflect reality. For example, forging signatures is possible, even though few people are skilled enough to actually do it. However, signatures tend not to be abused all that often, and thus hold up in court very well. All-in-all, ink signatures are a "good enough" solution.

Electronic signatures can do at least as well as physical signatures. This fact often amazes people, because they figure such a signature is akin to the signature files that people often place at the bottom of e-mail messages (a bunch of ASCII). If that were what digital signatures were like, then they wouldn't be very useful at all. It would be far to easy to copy a signature from one file, and append it directly to another file for a perfect forgery. It would also be possible to modify a document easily after it's signed, and no one would be the wiser. Thankfully, digital signatures are completely different.

Most digital signature systems use a combination of public key cryptography and cryptographic hashing algorithms. As we have explained, public key cryptosystems are usually used to encrypt a message using the receiver's public key, which the receiver then decrypts with the corresponding private key. A private key can also be used to encrypt a message that can only be decrypted using the corresponding public key. If a person keeps their private key completely private, (you haven't been hacked lately, have you?) being able to decrypt a message using the corresponding public key then constitutes proof that the person in question encrypted the original message.

Digital signatures are not just useful for signing documents -- they're useful for almost all authentication needs. For example, they are often used in conjunction with encryption in order to achieve both data privacy and data authentication.

A digital signature for a document is usually constructed by cryptographically hashing the document, and then encrypting the hash using a private key. The resulting ciphertext is called the signature. Anyone can validate the signature by hashing the document themselves, and then decrypting the signature (using either a public key or a shared secret key), and comparing the two hashes. If the two hashes are equal, then the signature is considered validated (assuming the person doing the validation believes the public key they used really belongs to you).

The signature doesn't need to be stored with the document. Also, the signature applies to any exact digital replica of the document. The signature can also be replicated, but there is no way to apply it to other documents, as the resulting hash would not match the decrypted hash.

The problem with digital signatures
The one problem with digital signatures is non-repudiation. People can always claim their key was stolen. Nonetheless, digital signatures are widely accepted as legal alternatives to a physical signature, because they come at least as close to meeting the ideals presented above as do physical signatures. At least 30 states have digital signature laws now, and more are likely to follow (if the U.S. Congress doesn't beat them to the punch by passing a national law).

Most public key algorithms, including RSA and ElGamal, can be easily extended to support digital signatures. In fact, a good software package that supports one of these algorithms should also support digital signatures. We recommend using your favorite public key cryptography algorithm for digital signatures. Try to use built-in primitives instead of building your own construction out of an encryption algorithm and a hash function.

Cool ideas in Applied Cryptography
Bruce Schneier discusses plenty of cool things you can do with cryptography. For example, he outlines how you might create your own digital certified mail system. He also discusses how to split up a cryptographic secret in such a way that you can distribute parts to n people, and the secret will only be revealed when m of those n people pool their resources together. It's definitely worth perusing his book, if just to fill your head with the myriad possibilities cryptography has to offer.

What's next?
In future articles, we'll cover several related topics. The most important topic is key management: How do you safely generate, store, change, destroy, or transfer cryptographic keys? We'll also get down to brass tacks, and actually study some real cryptographic packages, complete with code examples. We will look at the Cryptix package for Java applications, an SSL implementation for C, and an implementation of Kerberos. We'll also provide links to other cryptographic packages believed to be robust.

In our next installment, we'll go into nitty-gritty detail about how to add password management to a software application, and (perhaps more importantly) how not to.

Cryptography is a huge field, but only one facet of software security. We won't even pretend we'll cover it completely.

Resources

About the authors
Gary McGraw is the vice president of corporate technology at Reliable Software Technologies in Dulles, VA. Working with Consulting Services and Research, he helps set technology research and development direction. McGraw began his career at Reliable Software Technologies as a Research Scientist, and he continues to pursue research in Software Engineering and computer security. He holds a dual Ph.D. in Cognitive Science and Computer Science from Indiana University and a B.A. in Philosophy from the University of Virginia. He has written more than 40 peer-reviewed articles for technical publications, consults with major e-commerce vendors including Visa and the Federal Reserve, and has served as principal investigator on grants from Air Force Research Labs, DARPA, National Science Foundation, and NIST's Advanced Technology Program.
McGraw is a noted authority on mobile code security and co-authored both "Java Security: Hostile Applets, Holes, & Antidotes" (Wiley, 1996) and "Securing Java: Getting down to business with mobile code" (Wiley, 1999) with Professor Ed Felten of Princeton. Along with RST co-founder and Chief Scientist Dr. Jeffrey Voas, McGraw wrote "Software Fault Injection: Inoculating Programs Against Errors" (Wiley, 1998). McGraw regularly contributes to popular trade publications and is often quoted in national press articles.


John Viega is a Senior Research Associate, Software Security Group co-founder, and Senior Consultant at Reliable Software Technologies. He is the Principal Investigator on a DARPA-sponsored grant for developing security extensions for standard programming languages. John has authored over 30 technical publications in the areas of software security and testing. He is responsible for finding several well-publicized security vulnerabilities in major network and e-commerce products, including a recent break in Netscape's security. He is also a prominent member of the open-source software community, having written Mailman, the GNU Mailing List Manager, and, more recently, ITS4, a tool for finding security vulnerabilities in C and C++ code. Viega holds a M.S. in Computer Science from the University of Virginia.



e-mail it!
Rate this article

This content was helpful to me:

Strongly disagree (1)Disagree (2)Neutral (3)Agree (4)Strongly agree (5)

Comments?



developerWorks > Security >
developerWorks
  About IBM  |  Privacy  |  Terms of use  |  Contact