Technical Blog Post
IBM Watson - What's In that 1TB ?
"When Watson is booted up, the 15TB of total RAM are loaded up, and thereafter the DeepQA processing is all done from memory. According to IBM Research, the actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1 Terabyte (TB). For performance reasons, various subsets of the data are replicated in RAM on different functional groups of cluster nodes. The entire system is self-contained, Watson is NOT going to the internet searching for answers."
I had several readers ask me to explain the significance of the "Terabyte". I'll work my way up.
A bit is simply a zero (0) or one (1). This could answer a Yes/No or True/False question.
Most computers have standardized a byte as a collection of 8 bits. There are 256 unique combinations of ones and zeros possible, so a byte could be used to storage a 2-digit integer, or a single upper or lower case character in the English alphabet. In pratical terms, a byte could store your age in years, or your middle initial.
- Kilobyte (KB)
The Kilobyte is a thousand bytes, enough to hold a few paragraphs of text. A typical written page could be held in 4 KB, for example.
The IBM Challenge to play on Jeopardy! is being compared to the historic 1969 moon landing. To land on the moon, Apollo 11 had the "Apollo Guidance Computer" (AGC) which had 74KB of fixed read-only memory, and 2KB of re-writeable memory. Over [3500 IBM employees were involved] to get the astronauts to the moon and safely back to earth again.
The importance of this computer was highlighted in a [lecture by astronaut David Scott] who said: "If you have a basketball and a baseball 14 feet apart, where the baseball represents the moon and the basketball represents the Earth, and you take a piece of paper sideways, the thinness of the paper would be the corridor you have to hit when you come back."
- Megabyte (MB)
The Megabyte is a thousand KB, or a million bytes. The 3.5-inch floppy diskette, mentioned in my post [A Boxfull of Floppies] could hold 1.44MB, or about 360 pages of text.
The first commercial disk system, the [350 Disk Storage Unit, introduced in 1956 for the IBM RAMAC computer], could hold only 5 MB and was the size of two refrigerators. While 5MB might not seem like much today, it is enough to hold the [Complete works of Shakespeare]. That's right, all 42 plays, poems and sonnets.
In the article [Wikipedia as a printed book], the printing of a select 400 articles resulted in a book 29 inches thick. Those 5,000 pages would consume about 20 MB of space.
One of my favorite resources I use to search is the Internet Movie Data Base [IMDB]. Leaving out the photos and videos, the [text-only portion of the IMDB database is just over 600 MB], representing nearly all of the actors, awards, nominations, television shows and movies. A standard CD-ROM can hold 700MB, so the text portion of the IMDB could easily fit on a single CD.
- Gigabyte (GB)
The Gigabyte is a thousand MB, or a billion bytes. My Thinkpad T410 laptop has 4GB of RAM and 320GB of hard disk space. My laptop comes with a DVD burner, and each DVD can hold up to 4.7GB of information.
The popular Wikipedia now has some 17 million articles, of which 3.5 million are in English language. It would only take [14GB of space to hold the entire English portion] of Wikipedia. That is small enough to fit on twenty CDs, three DVDs, an Apple iPad or my cellphone (a Samsung Galaxy S Vibrant).
Perhaps you are thinking, "Someone should offer Wikipedia pre-installed on a small handheld!" Too late. The [The Humane Reader] is able to offer 5,000 books and Wikipedia in a small device that connects to your television. This would be great for people who do not have access to the internet, or for parents who want their kids to do their homework, but not be online while they are doing it.
In the latest 2009 report of [How Much Information?] from the University of California, San Diego, the average American consumes 34 GB of information. This includes all the information from radio, television, newspapers, magazines, books and the internet that a person might look at or listen to throughout the day. This project is sponsored by IBM and others to help people understand the nature of our information-consuption habits.
Back in 1992, I visited a client in Germany. Their 90 GB of disk storage attached to their mainframe was the size of three refrigerators, and took five full-time storage administrators to manage.
- Terabyte (TB)
The Terabyte is a thousand GB, or a trillion bytes. It is now possible to buy external USB drive for your laptop or personal computer that holds 1TB or more. However, at 40MB/sec speeds that USB 2.0 is capable of, it would take seven hours to do a bulk transfer in or out of the device.
IBM offers 1TB and 2TB disk drives in many of our disk systems. In 2008, IBM was preparing to announce the first 1TB tape drive. However, Sun Microsystems announced their own 1TB drive the day before our big announcement, so IBM had to rephrase the TS1130 announcement to [The World's Fastest 1TB tape drive!]
In his book [The Singularity is Near: When Humans Transcend Biology], Ray Kurzweil estimates the human brain's memory can hold about 1.25 TB of information. This would make IBM Watson about 80 percent human.
A typical academic research library will hold about 2TB of information. For the [US Library of Congress] print collection is considered to be about 10TB, and their web capture team has collected 160TB of digital data. If you are ever in the Washington DC, I strongly recommend a visit to the Library of Congress. It is truly stunning!
Full-length computer animated movies, like [Happy Feet], consume about 100TB of disk storage during production. IBM offers disk systems that can hold this much data. For example, the IBM XIV can hold up to 151 TB of usable disk space in the size of one refrigerator.
A Key Performance Indicator (KPI) for some larger companies is the number of TB that can be managed by a full-time employee, referred to as TB/FTE. Discussions about TB/FTE are available from IT analysts including [Forrester Research] and [The Info Pro].
The website [Ancestry.com] claims to have over 540 million names in its genealogical database, with a storage of 600TB, with the inclusion of [US census data from 1790 to 1930]. The US government took nine years to process the 1880 census, so for the 1890 census, it rented equipment from Herman Hollerith's Tabulating Machine Company. This company would later merge with two others in 1911 to form what is now called IBM.
- Petabyte (PB)
A Petabyte is thousand TB, or a quadrillion bytes. It is estimated that all printed materials on Earth would represent approximately 200 PB of information.
IBM's largest disk system, the Scale-Out Network Attach Storage (SONAS) comprised of up to 7,200 disk drives, which can hold over 11 PB of information. A smaller 10-frame model, the same size as IBM Watson, with six interface nodes and 19 storage pods, could hold over 7 PB of information.
IBM's automated [TS3500 tape library with high-density frames] can hold up to 60 PB of compressed tape data. A smaller 10-frame model, the size of IBM Watson, could hold up to 36 PB of data.
For those of us in the IT industry, 1TB is small potatoes. I for one, was expecting it to be much bigger. But for everyone else, the equivalent of 200 million pages of text that IBM Watson has loaded inside is an incredibly large repository of information. I suspect IBM Watson probably contains the complete works of Shakespeare as well as other fiction writers, the IMDB database, all 3.5 million articles of Wikipedia, religious texts like the Bible and the Quran, famous documents like the Magna Carta and the US Constitution, and reference books like a Dictionary, a Thesaurus, and "Gray's Anatomy". And, of course, lots and lots of lists.
For those on Twitter, follow [@ibmwatson] these next three days during the challenge.