IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
      
     Home      Products      Services & solutions      Support & downloads      My account     

developerWorks > Security >
developerWorks
Make your software behave: Software strategies
e-mail it!
Contents:
Doing a good job (with the right OS)
Doing a good job (in Java)
Doing an OK job (in C or C++)
Doing a better job (in C or C++)
Other considerations
Better pseudo-random number generators (PRNGs)
Other random number libraries
Resources
About the authors
Rate this article
Subscriptions:
dW newsletters
dW Subscription
(CDs and downloads)
In the absence of hardware, you can devise a reasonably secure random number generator through software

Gary McGraw, Vice president, Reliable Software Technologies
John Viega, Senior research associate and consultant, Reliable Software Technologies

18 Apr 2000

In this installment, the third in the series, Gary and John present strategies for using software to serve as an adequately secure random number generator.

Hardware can be the best source of "secure" random numbers, because it harvests randomness from nature, and can generate numbers fast enough for security-critical uses. Unfortunately, random number generation hardware costs money, and it's not likely to be on every user's desktop anytime soon. Fortunately, we can offer a strategy to achieve reasonable security through software alone.

Two installments ago, we showed why using library calls such as random() usually isn't a good idea. Real systems have fallen prey to this error more than once.

In the previous installment, we talked about using specialized hardware as a source of "secure" random numbers. Hardware-based solutions can be great for this kind of task, because they harvest randomness from nature that can be of high quality, and they can generate numbers at a rate more than fast enough for most security-critical uses. Unfortunately, hardware costs money, and it's not likely that this kind of hardware will be on every user's desktop anytime soon. If you are in a position to guarantee that randomness hardware will be available to all your users, then that's great. But what should you do for systems that can't make that guarantee?

Fortunately, some software-based solutions work fairly well. One common solution is to sample keyboard or mouse events from the user, and derive randomness based on timing or positional data. We briefly touched on such a solution earlier in the series. We noted that it had the problem of relying on input from the user, and thus wasn't useful on servers when a lot of stuff is going on. We also noted that there is some possibility for biased data, which can be removed via a cryptographic transformation. One problem that we did not mention is that the data sampled for randomness need to be unavailable to anyone who is a potential attacker. If you read mouse events under X Windows, most mouse events are potentially visible as they go over the network to other applications! On Windows, any application should be able to listen for mouse events, so you should be certain that nothing malicious is running on a machine before using event data as a high-entropy source.

Another problem with mouse events is what to do when the mouse doesn't change between events. There's no real entropy to be gained in such a situation from the positional information of the mouse. However, you might be able to gain a little bit of entropy based on how long it takes the mouse to move.

In software solutions, the idea for generating good random numbers is to collect as much data as possible that might have entropy in it, such as mouse and keyboard events, and mix it all up into a "pool" of randomness, which is capable of yielding random numbers. In addition to measuring physical events with demonstrable entropy, some people like to measure other information from a computer that changes quickly, and add that to the pool too. (We'll go into more detail about the kinds of things people sample.)

Often, people compute a random seed using system measurement techniques and then feed it into a pseudo-random number generator. Why not just continue to generate "secure" digits? Because generating pseudo-random numbers is fairly fast. By contrast, measuring physical data and then applying transformations to it to eliminate any bias and come up with a single random number can be time consuming (especially when you're measuring the slow, real world). Software-based solutions can usually average only a few bytes of high-quality random information per second.

Be forewarned! As we discussed last time, using a secure seed with a pseudo-random number generator can provide you with security, but there is no guarantee. If you're going to use this approach, make sure you have a good pseudo-random number generator. We'll provide you with one in this column, the Blum Blum Shub generator, which works very well, even though it is fairly slow compared to traditional random number generators. We'll also address random number generators based on cryptographic hashes.

Doing a good job (with the right OS)
Advanced operating systems are beginning to address the problem of randomness directly at the OS level. A good example of this comes from Linux (the popular open source version of UNIX).

If you're using a reasonably recent version of Linux (1995 or later), you can get "secure" random numbers directly from the operating system. Linux has a device that a programmer can read from called /dev/random which yields numbers that are (for all intents and purposes) pretty darn random.

How does /dev/random create random numbers?
The Linux operating system feeds a library data that are essentially random (or that at least have a strong random component). These data generally come from device drivers. For example, the keyboard driver might collect information about the time between key presses, and feed this environmental noise to the random number generator library.

Random data are stored in an entropy pool, which gets "stirred" every time new data come in. This stirring is really a mathematical transformation that helps improve randomness. When data are added to the entropy pool, the system comes up with an estimate of how many bits of true randomness were attained.

Measuring the amount of randomness is important. The problem is that some quantities are likely to have less randomness than they may seem to at first consideration. For example, adding a 32-bit number that represents the number of seconds since the last keyboard press doesn't actually give you 32 new bits of random information, because most key presses tend to be close together.

When you read bytes from /dev/random, the entropy pool is cryptographically hashed using the MD5 algorithm, and bytes from that hash are converted to a number, and then returned.

If there are no bits of randomness available in the entropy pool, /dev/random waits, not returning a result until there is enough randomness available in the pool. That means if you use /dev/random to produce a lot of random numbers, you might find it too slow to be useful. We have often watched /dev/random generate a dozen bytes of data, then yield nothing for many seconds. Argh.

Fortunately, there is an alternate interface to the entropy pool that gets around this limitation: /dev/urandom. The alternative device will always return a random number, even if there is no randomness available in the entropy pool. If you take out a lot of numbers without giving the entropy pool time to recharge, you no longer get the benefit of pooled entropy from various sources; but you still get pretty darn random numbers from the MD5 hash of the entropy pool! The problem with this approach is that if anyone could ever break the MD5 algorithm, and learn something about the input of the hash by looking at the output, then your numbers will suddenly become transparently predictable. Most experts don't believe such analysis is computationally feasible. Nonetheless, /dev/urandom is still considered "less secure" than /dev/random (and it often pays to be paranoid).

An alternative solution would be to use /dev/random to seed the Blum Blum Shub random number generator, which we introduce at the end of this installment.

The attached code listing contains some C code you can use that provides a fairly simple interface to the two Linux devices we just introduced. Our code returns 32-bit random numbers. Of course, this code will only work if the devices /dev/random and /dev/urandom are present on the system. This simple wrapper automates the task of getting information from the device and returning it as a number.

Doing a good job (in Java)
Lots of people don't run Linux; in fact, most people who use computers don't. In this section we'll provide a solution to the random number generation problem written in Java.

If you're willing to use Java, you can use the class java.security.SecureRandom, which returns a fairly good (cryptographically strong) random number. A benefit of this approach is portability. You can use the exact same code to get secure random numbers on a UNIX box, a Windows machine, or even a Mac.

In Java, you shouldn't use the SecureRandom class to seed java.util.Random, because java.util.Random is not good enough. Although java.util.Random takes a long time to return (leading some users to believe it must be a 64-bit pseudo-random number generator), it is really only a 48-bit generator (at least in the Sun JDK). As we discussed in the previous column, these kinds of generators can still be brute forced fairly easily. Unless you implement your own 64-bit random number generator, you're going to be stuck using SecureRandom if you need maximum security, even though it doesn't generate numbers as quickly as a pseudo-random number generator.

Here's what the SecureRandom class does (in the Sun JDK). When you create the class, either you pass it a seed, or it uses a self-generated seed. The seed is cryptographically hashed with the SHA-1 algorithm. Both the result and the internal state of the algorithm are added to the SHA hash to generate the next number. If you don't pass in a seed, the seed is generated by the "seeder," which is itself an instance of SecureRandom. The seeder itself is seeded using internal thread-timing information that should be fairly hard to predict. The requisite timing information can take several seconds to gather, so the first time you create an instance of SecureRandom, you should expect it to hang for a few seconds (up to 20).

According to Sun, the self-seeding algorithm is not well-scrutinized, and may be easy to predict. And, of course, the SecureRandom algorithm is only as good as the quality of the information it is using. On machines where very little is going on, the algorithm is not likely to be very good.

To get around this problem, you may want to use an alternate seeding mechanism. One common approach is to use keyboard events. Source code to make use of the keyboard is available in Jonathan Knudsen's book, Java Cryptography (see Resources).

A brief cryptographic note is also in order. The fact that the SHA hash never gets updated with random information after the initial seeding indicates that if the SHA hash were ever broken by cryptographic analysis, then Java's number generator would not be resistant to attack.

Here's some simple Java code that shuffles a Vector using the SecureRandom class, which we provide as an example of how to use SecureRandom:

import java.util.Vector;
import java.security.SecureRandom;

class VectorShuffler
{
    // This might take a few seconds!
    private static SecureRandom rng = new SecureRandom(); 

    private static int get_random_int()
   {
       int ret;
       byte[] b = new byte[4];
       rng.nextBytes(b);
       // Convert 4 bytes into one number between 2 and 2^32-1.
       ret =  (int)b[0];
       ret = ret*256+(int)b[1];
       ret = ret*256+(int)b[2];
       ret = ret*256+(int)b[3];
       return ret;
   }

   private static int get_random_int_in_range(int n)
   {
       double d = get_random_int()/(double)0xffffffff;
       if(d<0) d *= -1;
       return ((int)(d*n))%n;
   }

    static void shuffle(Vector v)
    {
        
        int i = v.size();
        while(--i != 0)  // We don't have to run the case where i == 0
        {
            Object swap;
             int r = get_random_int_in_range(i+1);
             swap = v.elementAt(r);
             v.setElementAt(v.elementAt(i), r);
             v.setElementAt(swap, i); 
        }
    }
}

Doing an OK job (in C or C++)
Now that we've shown you two ways to do it right, we'll provide some code that shows how this random number generation stuff works under the covers. What we're going to do is read a bunch of information from the system that we hope will change rapidly, and hope that the data have some intrinsic entropy in them. We'll stir that data into our entropy pool, from which we hope to be able to yield some reasonable numbers. This code is meant to be academic. It is meant to help explain the concepts. Don't use it! We'll discuss problems with the code, and then offer you a somewhat better alternative later in this article.

The code listing below produces cryptographically strong random numbers (32 bits or 64 bits in size). It generates about 15 to 20 random numbers per second on a modestly equipped Pentium II 300 with a fairly small load. Your mileage is likely to vary.

Our program reads the output of ps and a few other common commands, and samples the system clock between each command. Then, it does an MD5 hash on these data. From that output, it generates unsigned integers of the requested size (32 bits or 64 bits). We simply take the raw event and a timestamp, hash it into our pool, and we should end up getting a fair amount of entropy.

While this program should give you quite reasonable numbers, it's far more prone to defeat than an algorithm using better sources of randomness. The basic problem is that it's hard for us to determine how much real entropy is actually added each time we feed the entropy pool. Perhaps it's a fair bit, but maybe it's only a bit or two. If it's only a bit or two, we have to collect a lot of bits before we can have any confidence that people won't be able to guess the internal state of our generator.

In practice, an attacker would have to be able to reproduce the output of the commands exactly, and predict the time it takes to run each command fairly accurately. If that can be done, then it might be possible for them to predict the numbers you generate with this code. Again, this is theoretically possible, but we believe it isn't too likely in practice. We'd definitely have a hard time breaking code that used this library, especially if there's a lot happening on the target machine. But having a "hard time" does not mean it's impossible. One thing you don't want to do is use this library at machine startup, because the machine will often be in a very predictable state at that point. A good addition to this code would be to save the state of the entropy pool to disk across reboots.

If you are willing to capture keyboard timing information, you can use the add_noise() function we provide to add better random information to our entropy pool. For examples of how to do this in C, see the source code for PGP.

This code listing requires that you compile in the reference MD5 library provided by RSA (see Resources). Specifically, you need the files global.h, md5.h, and md5.c. You'll need to link in md5.o to get things to work. The code should be portable, but we've only tested it on recent versions of Linux and Solaris. Please let us know if you have problems running this code on other platforms. Again, this code is meant to be academic, and is not intended to be used in production systems. We probably would avoid putting this code into production, unless we first integrated the techniques we'll discuss in the next two sections.

Doing a better job (in C or C++)
The kinds of system information that we poll in the previous section don't change nearly as fast as we'd like. As a result, there is some concern that an attacker might be able to predict the state of our random number generator. True Rand, a technique by Don Mitchell and Matt Blaze, will do a better job than the techniques we've used above. The code is UNIX-specific, but it could be adapted to Windows with a bit of effort.

TrueRand posits some underlying randomness that can be observed even on idle CPUs. The approach involves measuring drift between the system clock, and the generation of interrupts on the processor. Empirical evidence suggests that this drift is present on all machines that use a different clock source for the CPU and interrupt timing, and is a good source of real randomness (and it hasn't been broken yet, at least not publicly).

The attached code listing could be massaged into the above framework, and could even completely replace the bulky calls to many system commands. Note that TrueRand doesn't do any work to remove bias, so the output will definitely need to be post-processed. We suggest sending the data to the add_noise function in the code we provided in the previous section, which will stir the data into the randomness pool, and will MD5-hash the pool before using the new information. That trick should remove all bias.

Other considerations
There are some other things to consider when thinking about software solutions to random number generation. One thing we don't do in our random number collector above is play "estimate the entropy." It would be nice if we could tell how many bits of "random" data we think the entropy pool is good for. That way, when we go to generate a 128-bit key for some cryptographic algorithm, we won't end up with a key that, in reality, offers less than 128 bits of security.

Guessing entropy can be quite difficult. The reason we didn't do this in our code is that we didn't feel that we could reliably estimate how much randomness our technique actually gathers. We expect that it varies greatly from machine to machine. On highly loaded servers, the amount of entropy is probably very high. On single-user machines with few running processes, we wouldn't be surprised if the amount of entropy generated was very small.

It is a bit easier to estimate entropy for TrueRand, since it isn't dependent on processor load. Matt Blaze believes that TrueRand produces 16 bits of entropy for every 32 bits output by the TrueRand algorithm. When sampling keyboard and mouse events, you should be conservative, and count each event (plus timestamp) as giving you one bit of entropy. An exception would be when you're polling the mouse, and the mouse hasn't moved since last time. We'd probably count that as 0 bits of entropy. Also, remember to be careful -- if attackers can see the same event stream that you can, you probably shouldn't be assigning any entropy to that data at all!

Another thing to consider is what happens when the internal state of the random number generator is somehow compromised. We'll talk about how random number generators might be attacked in a future installment. For now, let's just assume that it's possible, and that an attacker has managed to get our internal state. Assuming the attacker knows when we're outputting numbers, and thus, hashing our pool, then the entire stream of numbers is compromised, until we add more entropy to the pool that the attacker cannot guess. As a result, we should always try to keep the entropy pool refreshed with new entropy, so that the attacker can compromise as few numbers as possible.

It is also desirable to buffer up entropy before stirring it into the pool, if we are only collecting a bit or so of it at a time. Why? If we add a bit of new entropy between every random number, an attacker can easily brute-force that bit, and the generator remains compromised. Even if we add 64 individual bits to the generator in this way, the same attack will be possible, but the attacker will just have to perform it 64 times.

By contrast, if we wait until we have 64 bits of entropy collected before adding it back into our generator, the attacker has a much tougher problem, since there are now 264 possibilities that the attacker must brute force once. Previously, only two possibilities needed to be considered -- 64 times.

Better pseudo-random number generators (PRNGs)
Once we have a reasonable amount of entropy, it would be nice if we could use that entropy to seed a good pseudo-random number generator, because generating more random bits is a time-consuming process. Even though we didn't mention it at the time, we've already seen one example of a pseudo-random number generator that gives good results. Our pool-stirring code above took an MD5 hash of the pool before emitting each 32-bit quantity. Then, the resulting hash was mixed back into the pool, before generating a new quantity.

We can do this ad nauseum, even if we don't collect any new entropy. Of course, if we've collected 128 bits of entropy, and we've given out 256 bits, it is theoretically possible that an attacker could guess the internal state of the generator by watching the output. But it isn't very likely, because it would require the attacker to invert the hash operation, which is believed to be computationally infeasible. One caveat is that you do need to make sure not to expose the internal state of the generator. We only give out 32 bits that were yielded by hashing the entropy pool. Then, we mix back in the complete hash. That's a reasonable way to go, because you're not exposing much information about the pool itself (assuming, of course, that the MD5 hash is not invertable). If we just used our pool directly, and cast it to a series of numbers before mixing the pool, then we will have given away our internal state to any potential attacker.

Another pseudo-random number generator that doesn't rely on hash functions is the Blum Blum Shub generator. This generator requires you to seed it with some data that must be based on random input for maximal impact.

First, pick two large prime numbers, p and q, such that p mod 4 yields 3, and q mod 4 yields 3. These primes must remain secret. Calculate the Blum number N by multiplying the two together. Pick a random seed s, which is a random number between 1 and N (but must be neither p nor q). Remember, s should not be able to be guessed.

Calculate the start state of the generator as follows:

x0 = s2 mod N

This generator yields as little as one bit of data at a time. The ith random bit (bi) is computed as follows:


xi = xi-12 mod n
bi = xi mod 2

Basically, we're looking at the least significant bit of xi, and using that as a random bit.

It turns out that you can use more than just the single-least-significant bit at each step. For a Blum number N, you can safely use log2N bits at each step.

To avoid having to pick a new seed, and thus collect entropy each time the system boots, you can save the intermediate state of the algorithm across reboots. Of course, make sure you save it in a place safe from the prying eyes of potential attackers!

Much like the RSA public key cipher, the Blum Blum Shub algorithm gains its security through the difficulty of factoring large numbers. Therefore, you need to pick prime numbers of similar sizes to your RSA keys (we'll get to RSA soon), if you wish to have the same degree of security. That means you'll almost certainly be doing modular arithmetic with a software library, since your Blum number will be too big to be a primitive type.

Other random number libraries
Other libraries are out there for handling random numbers. We've discussed two excellent sources for UNIX machines -- /dev/random and TrueRand. For Windows, Yarrow is an outstanding solution from two highly respected cryptographers, Bruce Schneier and John Kelsey. Yarrow is fairly new, but is looked upon quite favorably by the cryptographic community. It's available in source code form (see Resources). Hopefully, there will be a UNIX implementation of Yarrow in the near future.

Another alternative to our code for stirring entropy is available in the OpenSSL library (see Resources). You can feed their library output from generators like TrueRand, /dev/random, or Yarrow, which will be mixed into their internal pool. From their pool, they generate random numbers using a hash-based pseudo-random number generator much like ours above.

We hope you find the examples we have provided here useful. Remember, your mileage may vary.

Keep your eye out for new randomness functionality built into the next generation of Intel chips. Hopefully, the engineers at Intel avoided the pitfalls we discussed above.

Resources

  • Read Java Cryptography by Jonathan Knudsen (O'Reilly and Associates, Inc., April 1998. ISBN: 1565924029).

  • Visit RSA's reference MD5 library.

  • Visit the OpenSSL library.

  • See the source code for Yarrow.

  • The developerWorks columns by Gary and John have been updated and expanded in the book Building Secure Software, which also includes plenty of new material. Pick up a copy today -- your users will thank you.

About the authors
Gary McGraw is the vice president of corporate technology at Reliable Software Technologies in Dulles, VA. Working with Consulting Services and Research, he helps set technology research and development direction. McGraw began his career at Reliable Software Technologies as a Research Scientist, and he continues to pursue research in Software Engineering and computer security. He holds a dual Ph.D. in Cognitive Science and Computer Science from Indiana University and a B.A. in Philosophy from the University of Virginia. He has written more than 40 peer-reviewed articles for technical publications, consults with major e-commerce vendors including Visa and the Federal Reserve, and has served as principal investigator on grants from Air Force Research Labs, DARPA, National Science Foundation, and NIST's Advanced Technology Program.
McGraw is a noted authority on mobile code security and co-authored both "Java Security: Hostile Applets, Holes, & Antidotes" (Wiley, 1996) and "Securing Java: Getting down to business with mobile code" (Wiley, 1999) with Professor Ed Felten of Princeton. Along with RST co-founder and Chief Scientist Dr. Jeffrey Voas, McGraw wrote "Software Fault Injection: Inoculating Programs Against Errors" (Wiley, 1998). McGraw regularly contributes to popular trade publications and is often quoted in national press articles.


John Viega is a Senior Research Associate, Software Security Group co-founder, and Senior Consultant at Reliable Software Technologies. He is the Principal Investigator on a DARPA-sponsored grant for developing security extensions for standard programming languages. John has authored over 30 technical publications in the areas of software security and testing. He is responsible for finding several well-publicized security vulnerabilities in major network and e-commerce products, including a recent break in Netscape's security. He is also a prominent member of the open-source software community, having written Mailman, the GNU Mailing List Manager, and, more recently, ITS4, a tool for finding security vulnerabilities in C and C++ code. Viega holds a M.S. in Computer Science from the University of Virginia.



e-mail it!
Rate this article

This content was helpful to me:

Strongly disagree (1)Disagree (2)Neutral (3)Agree (4)Strongly agree (5)

Comments?



developerWorks > Security >
developerWorks
  About IBM  |  Privacy  |  Terms of use  |  Contact