Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
Tony Pearson is a Master Inventor and Senior Software Engineer for the IBM Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services. You can also follow him on Twitter @az990tony.
(Short URL for this blog: ibm.co/Pearson
The IBM Challenge was a big success. One of the contestants, Ken Jennings, [welcomes our new computer overlords]. Congratulations are in order to the IBM Research team who pulled off this Herculean effort!
Some folks have poked fun at some of the odd responses and wager amounts from the IBM Watson computer during the three-day tournament. Others were surprised as I was that the impressive feat was done with less than 1TB of stored data. Here is what John Webster wrote in CNET yesterday, in hist article [What IBM's Watson says to storage systems developers]:
"All well and good. But here's what I find most interesting as a result of what IBM has done in response to the Grand Challenge that motivated Watson's creators. We know, from Tony Pearson's blog, that the foundation of Watson's data storage system is a modified IBM SONAS cluster with a total of 21.6TB of raw capacity. But Pearson also reveals another very significant, and to me, surprising data point: "When Watson is booted up, the 15TB of total RAM are loaded up, and thereafter the DeepQA processing is all done from memory. According to IBM Research, the actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1 Terabyte."
What Pearson just said is that the data set Watson actually uses to reach his push-the-button decision would fit on a 1TB drive. So much for big data?"
To better appreciate how difficult the challenge was, and how a small amount of data can answer a billion different questions, I thought I would cover Business Intelligence, Data Retrieval and Text Mining concepts.
"In this paper, business is a collection of activities carried
on for whatever purpose, be it science, technology,
commerce, industry, law, government, defense, et cetera.
The communication facility serving the conduct of a business
(in the broad sense) may be referred to as an intelligence
system. The notion of intelligence is also defined
here, in a more general sense, as the ability to apprehend
the interrelationships of presented facts in such a way as
to guide action towards a desired goal."
Ideally, when you need "Business Intelligence" to help you make a better decision, you perform data retrieval from a structured database for the specific information you are looking for. In other cases, you might be looking for insight, patterns or trends. In that case, you go "data mining" against your structured databases.
Here's a simple example. John runs a fruit stand. One day, he kept track of how many apples and oranges were bought by men and women. How many questions can we ask against this small set of data? Let's count them:
How many apples were sold to men?
How many apples were sold to women?
How many oranges were sold to men?
How many oranges were sold to women?
But wait! For each row and column, we can combine them into totals.
How many apples were sold in total?
How many oranges were sold in total?
How many fruit in total were sold to men?
How many fruit in total were sold to women?
How many fruit in total were sold?
But wait, there's more! Each row and column can be evaluated for relative percentages, as well as percentages of each cell compared to the total. You could make five relevant pie-charts from this data. This results in 16 more questions, such as:
Of the fruit purchased by men, what percentage for apples?
Of all the apples purchased, what percentage by women?
And that's not including more ethereal questions, such as:
Are there gender-specific preferences for different types of fruit?
What type of fruit do men prefer?
This is just for a small set, two market segments (by gender) and two products (apples and oranges). However, if you have many market segments (perhaps by age group, zip code, etc.) and many products, the number of queries that can be supported is huge. For small sets of data, you can easily do this with a spreadsheet program like IBM Lotus Symphony or Microsoft Excel.
But why limit yourself to two dimensions? The above example was just for one day's worth of activity, if John captures this data for every day for historical and seasonal trending, it can be represented as a three-dimensional cube. The number of queries becomes astronomical. This is the basis for Online Analytical Processing (OLAP), and three-dimensional tables are often referred to as [OLAP cubes].
Back in 1970, IBM invented the Structured Query Language [SQL], and today, nearly all modern relational databases support this, including IBM DB2, Informix, Microsoft SQL Server, and Oracle DB. SQL poses two challenges. First, you had to structure the data in advance to the way you expect to perform your ad-hoc queries. Deciding the groups and categories in advance can limit the way information is recorded and captured.
Second, you had to be skilled at SQL to phrase your queries correctly to retrieve the data you are after. What ended up happening was that skilled SQL programmers would develop "canned reports" with fixed SQL parameters, so that less-skilled business decision makers could base their decisions from these reports.
IBM has fully integrated stacks to help process structured data, combining servers, storage, and advanced analytics software into a complete appliance. IBM offers the [Smart Analytics System] for robust, customized deployments, and recently acquired [Netezza] for pre-configured, and more rapid deployments.
However, the bigger problem is that more than 80 percent of information is not structured!
Semi-structured data like email provides some searchable fields like From and Subject. The rest of the information is unstructured, such as text files, photographs, video and audio. To look for specific information in unstructured sources can be like looking for a needle in a haystack, and trying to get insight, patterns or trends involves text mining.
This, in effect, is what IBM Watson was able to perform so well this week. Finding the needle in the haystacks of unstructured data from 200 million pages of text stored in its system, combined with the ability to apprehend the interrelationships of meaning and subtle nuance, resulted in an impressive technology demonstration. Certainly, this new technology will be powerful for a variety of use cases across a broad set of industries!
Sadly, only 70 percent of doctors in the United States use Electronic Medical Record [EMR] systems. My own Primary Care Physician has made the switch, and told me he how much he loves having ready access to the information he needs. EMR systems reduce costs, help manage risk, and improve healthcare outcomes. It is no surprise that the U.S. government has taken a [stick-and-carrot approach] to encourage doctors to use them.
A frequent topic at the Tucson Executive Briefing Center where I work is how to make the most use of IT for healthcare and life sciences. For much of 2011 and 2012, I was also one of the technical advocates assigned to Wellpoint Insurance, in support of their adoption of IBM Watson technology for healthcare.
The Tucson Executive Briefing Center hosted 20 dignitaries from local companies and academia.
This is a historic competition, an exhibition match pitting a computer against the top two celebrated Jeopardy champions:
Brad Rutter, won $3.2 million USD on Jeopardy!, winning 5 days on the show, and then three later tournamets.
Ken Jennings, winning $2.5 million in a 74-day winning streak on Jeopardy!
One of the members of the audience had never seen an episode of Jeopardy! in his life.
(Note: there are NO SPOILERS in this blog post. If you have not yet watched the show, you are safe to continue reading the rest of this post. I will not
disclose the correct responses to any of the clues nor how well each contestant scored.)
Calline Sanchez, IBM Director, Systems Storage Development for Data Protection and Retention, kicked off today's ceremonies.
The IBM Watson computer, named after IBM founder Thomas J. Watson, has been developed over the past 4 years by a team of IBM scientists who set out to accomplish a grand challenge - build a computing system that rivals a human's ability to answer questions posed in natural language with speed, accuracy and confidence. IBM Research labs in the United States, Japan, China and Israel [collaborated with Artificial Intelligence (AI) experts at eight universities], including Massachusetts Institute of Technology (MIT), University of Texas (UT) at Austin, University of Southern California (USC), Rensselaer Polytechnic Institute (RPI), University at Albany (UAlbany), University of Trento (Italy), University of Massachusetts Amherst, and Carnegie Mellon University.
(Disclaimer: I attended the University of Texas at Austin. My father attended Carnegie Mellon University.)
Last week, NOVA on PBS had a special episode on the making of IBM Watson, you can [watch it online] on their website. Delaney Turner, IBM Social Media Communications Manager for Business Analytics Software, has posted [his observations of Nova].
Since IBM Watson is the size of 10 refrigerators and weighs over 14,000 pounds, it was easier to design the Jeopardy! set at the TJ Watson Research lab in Yorktown Heights, NY, than to ship it over to California where the show is normally recorded. Two of the visual designers that worked on this set, as well as on the visual appearance of Watson, live in Tucson and were part of our audience today.
The IBM Challenge consists of a two-game tournament, where the scores of both games will be added to determine winner rankings. The producers of Jeopardy! will give $1 million dollars USD to first place, $300,000 to second place, and $200,000 to third place. Regardless of outcome, [IBM will donate all of its winings to charity]. The two human contestants plan to donate half of their earnings to their favorite charities as well.
Jeopardy! The IBM Challenge
Alex Trebek introduces IBM Watson, explaining that it can neither hear nor see. It will receive all information electronically. Categories and clues will be sent as text files via TCP/IP over Ethernet at the same time the two human contestants see them so that all have the same time to think about the right answer.
Watson has two rows of five racks, back to back. This was done so that cold air could rise up from holes in the tile floors around the unit, and all the hot air would be forced into the center and up to the ceiling return. This technique is known as "hot aisle/cold aisle" design. Alex Trebek opens one of the rack doors to show a series of 4U-high IBM Power 750 servers.
The avatar is a representation of Watson, as the machine itself is too big to fit behind the podium. The avatar is IBM's "Smarter Planet" logo with orbiting streaks and circles. It shows "Green" when it has high confidence, and orange when it gets an answer wrong. When busy thinking, the streaks and circles speed up, the closest we will see to "watching a computer sweat."
During the show, an "Answer panel" shows Watson's top three candidate responses, with confidence level compared to its current "buzz threshold".
Watson knows what it knows, and knows what it doesn't know. Here is an [Interactive Watson Game] on New York Times website to give you an idea of how the answer panel works. I was impressed with how close all three candidate answers were. In a question about Olympic swimmers, all three candidates are Olympic swimmers. In a question about the novel "Les Miserables", all three candidates were characters of that novel.
Well, IBM Watson did well, but missed answered some questions incorrectly. This [parody Slate video] pokes fun at this. Here were some discussions we had after the show ended:
IBM did not do well in categories that required [abductive reasoning]. For example, to identify two or three things that happened in different years, and then postulate that what they all have in common is a specific decade (such as the 1950s) is difficult.
Watson does not hear the wrong answers from the two human contestants. For one question, Ken buzzes in first, guesses wrong, then Watson buzzes in with the same exact response. Alex Trebek rebukes Watson with "No, Ken just said that!" Brad would learn from their mistakes and guess correctly for the score.
Watson is provided the correct answer after a contestant guesses it correctly, or if nobody does, when Alex provides the correct response. This is sent as a text message to Watson immediately, so that it can use this information to adjust its algorithms and machine-learning for future clues in that same category. This was evident in the "Answer panel" on the fourth and fifth attempts on the category of "Decades".
With this demonstration, IBM Research has advanced science by leaps and bounds for the Articial Intelligence community. IBM is a leader in Business Analytics, and this technology will find uses in a variety of industries. The average knowledge worker spends 30 percent of her time looking for information on corporate data repositories. By demonstrating a computer that can provide answers quickly, employees will be more productive, make stronger business decisions, and have greater insight.
Day 1 was only able to cover the first round of Game 1. This allowed more time to talk about the history and technology of IBM Watson. Tomorrow, the contestants will finish Game 1 and head into Game 2.
"When Watson is booted up, the 15TB of total RAM are loaded up, and thereafter the DeepQA processing is all done from memory. According to IBM Research, the actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1 Terabyte (TB). For performance reasons, various subsets of the data are replicated in RAM on different functional groups of cluster nodes. The entire system is self-contained, Watson is NOT going to the internet searching for answers."
I had several readers ask me to explain the significance of the "Terabyte". I'll work my way up.
A bit is simply a zero (0) or one (1). This could answer a Yes/No or True/False question.
Most computers have standardized a byte as a collection of 8 bits. There are 256 unique combinations of ones and zeros possible, so a byte could be used to storage a 2-digit integer, or a single upper or lower case character in the English alphabet. In pratical terms, a byte could store your age in years, or your middle initial.
The Kilobyte is a thousand bytes, enough to hold a few paragraphs of text. A typical written page could be held in 4 KB, for example.
The IBM Challenge to play on Jeopardy! is being compared to the historic 1969 moon landing. To land on the moon, Apollo 11 had the "Apollo Guidance Computer" (AGC) which had 74KB of fixed read-only memory, and 2KB of re-writeable memory. Over [3500 IBM employees were involved] to get the astronauts to the moon and safely back to earth again.
The importance of this computer was highlighted in a [lecture by astronaut David Scott] who said: "If you have a basketball and a baseball 14 feet apart, where the baseball represents the moon and the basketball represents the Earth, and you take a piece of paper sideways, the thinness of the paper would be the corridor you have to hit when you come back."
The Megabyte is a thousand KB, or a million bytes. The 3.5-inch floppy diskette, mentioned in my post [A Boxfull of Floppies] could hold 1.44MB, or about 360 pages of text.
In the article [Wikipedia as a printed book], the printing of a select 400 articles resulted in a book 29 inches thick. Those 5,000 pages would consume about 20 MB of space.
One of my favorite resources I use to search is the Internet Movie Data Base [IMDB]. Leaving out the photos and videos, the [text-only portion of the IMDB database is just over 600 MB], representing nearly all of the actors, awards, nominations, television shows and movies. A standard CD-ROM can hold 700MB, so the text portion of the IMDB could easily fit on a single CD.
The Gigabyte is a thousand MB, or a billion bytes. My Thinkpad T410 laptop has 4GB of RAM and 320GB of hard disk space. My laptop comes with a DVD burner, and each DVD can hold up to 4.7GB of information.
The popular Wikipedia now has some 17 million articles, of which 3.5 million are in English language. It would only take [14GB of space to hold the entire English portion] of Wikipedia. That is small enough to fit on twenty CDs, three DVDs, an Apple iPad or my cellphone (a Samsung Galaxy S Vibrant).
Perhaps you are thinking, "Someone should offer Wikipedia pre-installed on a small handheld!" Too late. The [The Humane Reader] is able to offer 5,000 books and Wikipedia in a small device that connects to your television. This would be great for people who do not have access to the internet, or for parents who want their kids to do their homework, but not be online while they are doing it.
In the latest 2009 report of [How Much Information?] from the University of California, San Diego, the average American consumes 34 GB of information. This includes all the information from radio, television, newspapers, magazines, books and the internet that a person might look at or listen to throughout the day. This project is sponsored by IBM and others to help people understand the nature of our information-consuption habits.
Back in 1992, I visited a client in Germany. Their 90 GB of disk storage attached to their mainframe was the size of three refrigerators, and took five full-time storage administrators to manage.
The Terabyte is a thousand GB, or a trillion bytes. It is now possible to buy external USB drive for your laptop or personal computer that holds 1TB or more. However, at 40MB/sec speeds that USB 2.0 is capable of, it would take seven hours to do a bulk transfer in or out of the device.
IBM offers 1TB and 2TB disk drives in many of our disk systems. In 2008, IBM was preparing to announce the first 1TB tape drive. However, Sun Microsystems announced their own 1TB drive the day before our big announcement, so IBM had to rephrase the TS1130 announcement to [The World's Fastest 1TB tape drive!]
A typical academic research library will hold about 2TB of information. For the [US Library of Congress] print collection is considered to be about 10TB, and their web capture team has collected 160TB of digital data. If you are ever in the Washington DC, I strongly recommend a visit to the Library of Congress. It is truly stunning!
Full-length computer animated movies, like [Happy Feet], consume about 100TB of disk storage during production. IBM offers disk systems that can hold this much data. For example, the IBM XIV can hold up to 151 TB of usable disk space in the size of one refrigerator.
A Key Performance Indicator (KPI) for some larger companies is the number of TB that can be managed by a full-time employee, referred to as TB/FTE. Discussions about TB/FTE are available from IT analysts including [Forrester Research] and [The Info Pro].
The website [Ancestry.com] claims to have over 540 million names in its genealogical database, with a storage of 600TB, with the inclusion of [US census data from 1790 to 1930]. The US government took nine years to process the 1880 census, so for the 1890 census, it rented equipment from Herman Hollerith's Tabulating Machine Company. This company would later merge with two others in 1911 to form what is now called IBM.
A Petabyte is thousand TB, or a quadrillion bytes. It is estimated that all printed materials on Earth would represent approximately 200 PB of information.
IBM's largest disk system, the Scale-Out Network Attach Storage (SONAS) comprised of up to 7,200 disk drives, which can hold over 11 PB of information. A smaller 10-frame model, the same size as IBM Watson, with six interface nodes and 19 storage pods, could hold over 7 PB of information.
For those of us in the IT industry, 1TB is small potatoes. I for one, was expecting it to be much bigger. But for everyone else, the equivalent of 200 million pages of text that IBM Watson has loaded inside is an incredibly large repository of information. I suspect IBM Watson probably contains the complete works of Shakespeare as well as other fiction writers, the IMDB database, all 3.5 million articles of Wikipedia, religious texts like the Bible and the Quran, famous documents like the Magna Carta and the US Constitution, and reference books like a Dictionary, a Thesaurus, and "Gray's Anatomy". And, of course, lots and lots of lists.
For those on Twitter, follow [@ibmwatson] these next three days during the challenge.
For the longest time, people thought that humans could not run a mile in less than four minutes. Then, in 1954, [Sir Roger Bannister] beat that perception, and shortly thereafter, once he showed it was possible, many other runners were able to achieve this also. The same is being said now about the IBM Watson computer which appeared this week against two human contestants on Jeopardy!
(2014 Update: A lot has happened since I originally wrote this blog post! I intended this as a fun project for college students to work on during their summer break. However, IBM is concerned that some businesses might be led to believe they could simply stand up their own systems based entirely on open source and internally developed code for business use. IBM recommends instead the [IBM InfoSphere BigInsights] which packages much of the software described below. IBM has also launched a new "Watson Group" that has [Watson-as-a-Service] capabilities in the Cloud. To raise awareness to these developments, IBM has asked me to rename this post from IBM Watson - How to build your own "Watson Jr." in your basement to the new title IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement. I also took this opportunity to improve the formatting layout.)
Often, when a company demonstrates new techology, these are prototypes not yet ready for commercial deployment until several years later. IBM Watson, however, was made mostly from commercially available hardware, software and information resources. As several have noted, the 1TB of data used to search for answers could fit on a single USB drive that you buy at your local computer store.
Take a look at the [IBM Research Team] to determine how the project was organized. Let's decide what we need, and what we don't in our version for personal use:
Do we need it for personal use?
Yes, That's you. Assuming this is a one-person project, you will act as Team Lead.
Yes, I hope you know computer programming!
No, since this version for personal use won't be appearing on Jeopardy, we won't need strategy on wager amounts for the Daily Double, or what clues to pick next. Let's focus merely on a computer that can accept a question in text, and provide an answer back, in text.
Yes, this team focused on how to wire all the hardware together. We need to do that, although this version for personal use will have fewer components.
Optional. For now, let's have this version for personal use just return its answer in plain text. Consider this Extra Credit after you get the rest of the system working. Consider using [eSpeak], [FreeTTS], or the Modular Architecture for Research on speech sYnthesis [MARY] Text-to-Speech synthesizers.
Yes, I will explain what this is, and why you need it.
Yes, we will need to get information for personal use to process
Yes, this team developed a system for parsing the question being asked, and to attach meaning to the different words involved.
No, this team focused on making IBM Watson optimized to answer in 3 seconds or less. We can accept a slower response, so we can skip this.
(Disclaimer: As with any Do-It-Yourself (DIY) project, I am not responsible if you are not happy with your version for personal use I am basing the approach on what I read from publicly available sources, and my work in Linux, supercomputers, XIV, and SONAS. For our purposes, this version for personal use is based entirely on commodity hardware, open source software, and publicly available sources of information. Your implementation will certainly not be as fast or as clever as the IBM Watson you saw on television.)
Step 1: Buy the Hardware
Supercomputers are built as a cluster of identical compute servers lashed together by a network. You will be installing Linux on them, so if you can avoid paying extra for Microsoft Windows, that would save you some money. Here is your shopping list:
Three x86 hosts, with the following:
64-bit quad-core processor, either Intel-VT or AMD-V capable,
8GB of DRAM, or larger
300GB of hard disk, or larger
CD or DVD Read/Write drive
Computer Monitor, mouse and keyboard
Ethernet 1GbE 4-port hub, and appropriate RJ45 cables
Surge protector and Power strip
Local Console Monitor (LCM) 4-port switch (formerly known as a KVM switch) and appropriate cables. This is optional, but will make it easier during the development. Once your implementation is operational, you will only need the monitor and keyboard attached to one machine. The other two machines can remain "headless" servers.
Step 2: Establish Networking
IBM Watson used Juniper switches running at 10Gbps Ethernet (10GbE) speeds, but was not connected to the Internet while playing Jeopardy! Instead, these Ethernet links were for the POWER7 servers to talk to each other, and to access files over the Network File System (NFS) protocol to the internal customized SONAS storage I/O nodes.
The implementation will be able to run "disconnected from the Internet" as well. However, you will need Internet access to download the code and information sources. For our purposes, 1GbE should be sufficient. Connect your Ethernet hub to your DSL or Cable modem. Connect all three hosts to the Ethernet switch. Connect your keyboard, video monitor and mouse to the LCM, and connect the LCM to the three hosts.
Step 3: Install Linux and Middleware
To say I use Linux on a daily basis is an understatement. Linux runs on my Android-based cell phone, my laptop at work, my personal computers at home, most of our IBM storage devices from SAN Volume Controller to XIV to SONAS, and even on my Tivo at home which recorded my televised episodes of Jeopardy!
For this project, you can use any modern Linux distribution that supports KVM. IBM Watson used Novel SUSE Linux Enterprise Server [SLES 11]. Alternatively, I can also recommend either Red Hat Enterprise Linux [RHEL 6] or Canonical [Ubuntu v10]. Each distribution of Linux comes in different orientations. Download the the 64-bit "ISO" files for each version, and burn them to CDs.
Graphical User Interface (GUI) oriented, often referred to as "Desktop" or "HPC-Head"
Command Line Interface (CLI) oriented, often referred to as "Server" or "HPC-Compute"
Guest OS oriented, to run in a Hypervisor such as KVM, Xen, or VMware. Novell calls theirs "Just Enough Operating System" [JeOS].
For this version for personal use, I have chosen a [multitier architecture], sometimes referred to as an "n-tier" or "client/server" architecture.
Host 1 - Presentation Server
For the Human-Computer Interface [HCI], the IBM Watson received categories and clues as text files via TCP/IP, had a [beautiful avatar] representing a planet with 42 circles streaking across in orbit, and text-to-speech synthesizer to respond in a computerized voice. Your implementation will not be this sophisticated. Instead, we will have a simple text-based Query Panel web interface accessible from a browser like Mozilla Firefox.
Host 1 will be your Presentation Server, the connection to your keyboard, video monitor and mouse. Install the "Desktop" or "HPC Head Node" version of Linux. Install [Apache Web Server and Tomcat] to run the Query Panel. Host 1 will also be your "programming" host. Install the [Java SDK] and the [Eclipse IDE for Java Developers]. If you always wanted to learn Java, now is your chance. There are plenty of books on Java if that is not the language you normally write code.
While three little systems doesn't constitute an "Extreme Cloud" environment, you might like to try out the "Extreme Cloud Administration Tool", called [xCat], which was used to manage the many servers in IBM Watson.
Host 2 - Business Logic Server
Host 2 will be driving most of the "thinking". Install the "Server" or "HPC Compute Node" version of Linux. This will be running a server virtualization Hypervisor. I recommend KVM, but you can probably run Xen or VMware instead if you like.
Host 3 - File and Database Server
Host 3 will hold your information sources, indices, and databases. Install the "Server" or "HPC Compute Node" version of Linux. This will be your NFS server, which might come up as a question during the installation process.
Technically, you could run different Linux distributions on different machines. For example, you could run "Ubuntu Desktop" for host 1, "RHEL 6 Server" for host 2, and "SLES 11" for host 3. In general, Red Hat tries to be the best "Server" platform, and Novell tries to make SLES be the best "Guest OS".
My advice is to pick a single distribution and use it for everything, Desktop, Server, and Guest OS. If you are new to Linux, choose Ubuntu. There are plenty of books on Linux in general, and Ubuntu in particular, and Ubuntu has a helpful community of volunteers to answer your questions.
Step 4: Download Information Sources
You will need some documents for your implementation to process.
IBM Watson used a modified SONAS to provide a highly-available clustered NFS server. For this version, we won't need that level of sophistication. Configure Host 3 as the NFS server, and Hosts 1 and 2 as NFS clients. See the [Linux-NFS-HOWTO] for details. To optimize performance, host 3 will be the "official master copy", but we will use a Linux utility called rsync to copy the information sources over to the hosts 1 and 2. This allows the task engines on those hosts to access local disk resources during question-answer processing.
We will also need a relational database. You won't need a high-powered IBM DB2. Your implementation can do fine with something like [Apache Derby] which is the open source version of IBM CloudScape from its Informix acquisition. Set up Host 3 as the Derby Network Server, and Hosts 1 and 2 as Derby Network Clients. For more about structured content in relational databases, see my post [IBM Watson - Business Intelligence, Data Retrieval and Text Mining].
Linux includes a utility called wget which allows you to download content from the Internet to your system. What documents you decide to download is up to you, based on what types of questions you want answered. For example, if you like Literature, check out the vast resources at [FullBooks.com]. You can automate the download by writing a shell script or program to invoke wget to all the places you want to fetch data from. Rename the downloaded files to something unique, as often they are just "index.html". For more on wget utility, see [IBM Developerworks].
Step 5: The Query Panel - Parsing the Question
Next, we need to parse the question and have some sense of what is being asked for. For this we will use [OpenNLP] for Natural Language Processing, and [OpenCyc] for the conceptual logic reasoning. See Doug Lenat presenting this 75-minute video [Computers versus Common Sense]. To learn more, see the [CYC 101 Tutorial].
Unlike Jeopardy! where Alex Trebek provides the answer and contestants must respond with the correct question, we will do normal Question-and-Answer processing. To keep things simple, we will limit questions to the following formats:
Who is ...?
Where is ...?
When did ... happen?
What is ...?
Host 1 will have a simple Query Panel web interface. At the top, a place to enter your question, and a "submit" button, and a place at the bottom for the answer to be shown. When "submit" is pressed, this will pass the question to "main.jsp", the Java servlet program that will start the Question-answering analysis. Limiting the types of questions that can be posed will simplify hypothesis generation, reduce the candidate set and evidence evaluation, allowing the analytics processing to continue in reasonable time.
Step 6: Unstructured Information Management Architecture
The "heart and soul" of IBM Watson is Unstructured Information Management Architecture [UIMA]. IBM developed this, then made it available to the world as open source. It is maintained by the [Apache Software Foundation], and overseen by the Organization for the Advancement of Structured Information Standards [OASIS].
Basically, UIMA lets you scan unstructured documents, gleam the important points, and put that into a database for later retrieval. In the graph above, DBs means 'databases' and KBs means 'knowledge bases'. See the 4-minute YouTube video of [IBM Content Analytics], the commercial version of UIMA.
Starting from the left, the Collection Reader selects each document to process, and creates an empty Common Analysis Structure (CAS) which serves as a standardized container for information. This CAS is passed to Analysis Engines , composed of one or more Annotators which analyze the text and fill the CAS with the information found. The CAS are passed to CAS Consumers which do something with the information found, such as enter an entry into a database, update an index, or update a vote count.
(Note: This point requires, what we in the industry call a small matter of programming, or [SMOP]. If you've always wanted to learn Java programming, XML, and JDBC, you will get to do plenty here. )
If you are not familiar with UIMA, consider this [UIMA Tutorial].
Step 7: Parallel Processing
People have asked me why IBM Watson is so big. Did we really need 2,880 cores of processing power? As a supercomputer, the 80 TeraFLOPs of IBM Watson would place it only in 94th place on the [Top 500 Supercomputers]. While IBM Watson may be the [Smartest Machine on Earth], the most powerful supercomputer at this time is the Tianhe-1A with more than 186,000 cores, capable of 2,566 TeraFLOPs.
To determine how big IBM Watson needed to be, the IBM Research team ran the DeepQA algorithm on a single core. It took 2 hours to answer a single Jeopardy question! Let's look at the performance data:
Number of cores
Time to answer one Jeopardy question
Single IBM Power750 server
< 4 minutes
Single rack (10 servers)
< 30 seconds
IBM Watson (90 servers)
< 3 seconds
The old adage applies, [many hands make for light work]. The idea is to divide-and-conquer. For example, if you wanted to find a particular street address in the Manhattan phone book, you could dispatch fifty pages to each friend and they could all scan those pages at the same time. This is known as "Parallel Processing" and is how supercomputers are able to work so well. However, not all algorithms lend well to parallel processing, and the phrase [nine women can't have a baby in one month] is often used to remind us of this.
Fortuantely, UIMA is designed for parallel processing. You need to install UIMA-AS for Asynchronous Scale-out processing, an add-on to the base UIMA Java framework, supporting a very flexible scale-out capability based on JMS (Java Messaging Services) and ActiveMQ. We will also need Apache Hadoop, an open source implementation used by Yahoo Search engine. Hadoop has a "MapReduce" engine that allows you to divide the work, dispatch pieces to different "task engines", and the combine the results afterwards.
Host 2 will run Hadoop and drive the MapReduce process. Plan to have three KVM guests on Host 1, four on Host 2, and three on Host 3. That means you have 10 task engines to work with. These task engines can be deployed for Content Readers, Analysis Engines, and CAS Consumers. When all processing is done, the resulting votes will be tabulated and the top answer displayed on the Query Panel on Host 1.
Step 8: Testing
To simplify testing, use a batch processing approach. Rather than entering questions by hand in the Query Panel, generate a long list of questions in a file, and submit for processing. This will allow you to fine-tune the environment, optimize for performance, and validate the answers returned.
There you have it. By the time you get your implementation fully operational, you will have learned a lot of useful skills, including Linux administration, Ethernet networking, NFS file system configuration, Java programming, UIMA text mining analysis, and MapReduce parallel processing. Hopefully, you will also gain an appreciation for how difficult it was for the IBM Research team to accomplish what they had for the Grand Challenge on Jeopardy! Not surprisingly, IBM Watson is making IBM [as sexy to work for as Apple, Google or Facebook], all of which started their business in a garage or a basement with a system as small as this version for personal use.