This blog is for the open exchange of ideas relating to IBM Systems, storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
Tony Pearson is a Master Inventor, Senior IT Architect and Event Content Manager for [IBM Systems for IBM Systems Technical University] events. With over 30 years with IBM Systems, Tony is frequent traveler, speaking to clients at events throughout the world.
Lloyd Dean is an IBM Senior Certified Executive IT Architect in Infrastructure Architecture. Lloyd has held numerous senior technical roles at IBM during his 19 plus years at IBM. Lloyd most recently has been leading efforts across the Communication/CSI Market as a senior Storage Solution Architect/CTS covering the Kansas City territory. In prior years Lloyd supported the industry accounts as a Storage Solution architect and prior to that as a Storage Software Solutions specialist during his time in the ATS organization.
Lloyd currently supports North America storage sales teams in his Storage Software Solution Architecture SME role in the Washington Systems Center team. His current focus is with IBM Cloud Private and he will be delivering and supporting sessions at Think2019, and Storage Technical University on the Value of IBM storage in this high value IBM solution a part of the IBM Cloud strategy. Lloyd maintains a Subject Matter Expert status across the IBM Spectrum Storage Software solutions. You can follow Lloyd on Twitter @ldean0558 and LinkedIn Lloyd Dean.
Tony Pearson's books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
The developerWorks Connections Platform is now in read-only mode and content is only available for viewing. No new wiki pages, posts, or messages may be added. Please see our FAQ for more information. The developerWorks Connections platform will officially shut down on March 31, 2020 and content will no longer be available. More details available on our FAQ. (Read in Japanese.)
Well, it's that Back-To-School time again! Mo's thirteen-year-old reluctantly enters the eight grade, still upset the summer ended so abruptly. Richard's nephew returns to the University of Arizona for another year. Natalie has chosen to move to Phoenix and pursue a post-grad degree at Arizona State University. They all have two things in common, they all want a new computer, and they are all on a budget.
Fellow blogger Bob Sutor (IBM) pointed me to an excellent article on [How to Build Your Own $200 PC], which reminded me of the [XS server I built] for my 2008 Google Summer of Code project with the One Laptop per Child organization. Now that the project is over, I have upgraded it to Ubuntu Desktop 10.04 LTS, known as Lucid Lynx. Building your own PC with your student is a great learning experience in itself. Of course, this is just the computer itself, you still need to buy the keyboard, mouse and video monitor separately, if you don't already have these.
If you are not interested in building a PC from scratch, consider taking an old Windows-based PC and installing Linux to bring it new life. Many of the older PCs don't have enough processor or memory to run Windows Vista or the latest Windows 7, but they will all run Linux.
(If you think your old system has resale value, try checking out the ["trade-in estimator"] at the BestBuy website to straighten out your misperception. However, if you do decide to sell your system, consider replacing the disk drive with a fresh empty one, or wipe the old drive clean with one of the many free Linux utilities. Jason Striegel on Engadget has a nice [HOWTO Erase your old hard disk drive] article. If you don't have your original manufacturer's Windows installation discs, installing Linux instead may help keep you out of legal hot water.)
Depending on what your school projects require, you want to make sure that you can use a printer or scanner with your Linux system. Don't buy a printer unless it is supported by Linux. The Linux Foundation maintains a [Printer Compatability database]. Printing was one of the first things I got working for my Linux-based OLPC laptop, which I documented in my December 2007 post [Printing on XO Laptop with CUPS and LPR] and got a surprising following over at [OLPC News].
To reduce paper, many schools are having students email their assignments, or use Cloud Computing services like Google Docs. Both the University of Arizona and Arizona State University use Google Docs, and the students I have talked with love the idea. Whether they use a Mac, Linux or Windows PC, all students can access Google Docs through their browser. An alternative to Google Docs is Windows Live Skydrive, which has the option to upload and edit the latest Office format documents from the Firefox browser on Linux. Both offer you the option to upload GBs of files, which could be helpful transferring data from an old PC to a new one.
Lastly, there are many free video games for Linux, for when you need to take a break from all that studying. Ever since IBM's [36-page Global Innovation Outlook 2.0] study showed that playing video games made you a better business leader, I have been encouraging all students that I tutor or mentor that playing games is a more valuable use of your time than watching television. IBM considers video games the [future of learning]. Even the [Violent Video Games are Good for Kids]. It is no wonder that IBM provides the technology that runs all the major game platforms, including Microsoft Xbox360, Nintendo Wii and Sony PlayStation.
(FTC disclosure: I work for IBM. IBM has working relationships with Apple, Google, Microsoft, Nintendo and Sony. I use both Google Docs and Microsoft Live Skydrive for personal use, and base my recommendations purely on my own experience. I own stock in IBM, Google and Apple. I have friends and family that work at Microsoft. I own an Apple Mac Mini and Sony PlayStation. I was a Linux developer earlier in my IBM career. IBM considers Linux a strategic operating system for both personal and professional use. IBM has selected Firefox as its standard browser internally for all employees. I run Linux both at home and at the office. I graduated from the University of Arizona, and have friends who either work or take classes there, as well as at Arizona State University.)
Linux skills are marketable and growing more in demand. Linux is used in everything from cellphones to mainframes, as well as many IBM storage devices such as the IBM SAN Volume Controller, XIV and ProtecTIER data deduplication solution. In addition to writing term papers, spreadsheets and presentations with OpenOffice, your Linux PC can help you learn programming skills, web design, and database administration.
To all the students in my life, I wish you all good things in the upcoming school year!
Last week, I presented "An Introduction to Cloud Computing" for two hours to the local Institute of Management Accountants [IMA] for their Continuing Professional Education [CPE]. Since I present IBM's leadership in Cloud Storage offerings, I have had to become an expert in Cloud Computing overall. The audience was a mix of bookkeepers, accountants, auditors, comptrollers, CPAs, and accounting teachers.
Here is a sample of the questions I took during and after my presentation:
If I need to shut down host machine, I lose all my virtual machines as well?
No, it is possible to seemlessly move virtual machines from one host to another. If you need to shut down a host machine, move all the VMs to other hosts, then you can shut down the empty host without impacting business.
Does the SaaS provider have to build their own app, can they not buy an app and then rent it out?
Yes, but they won't have competitive differentiation, and the software development they buy from will want a big cut of the action. SaaS developers that build their own applications can keep more of the profits for themselves.
How do backups work in cloud computing? Do I have to contact someone at the cloud computing company to find the backup tape?
Large datacenters often keep the most recent backups on disk, and older versions on tape in automated tape libraries that can fetch your backup in less than 2 minutes. Because of this, there is no need to talk to anyone, you can schedule or invoke your own backups, and often perform the recovery yourself using self-service tools.
Last month, my sister tried to rent a car during the week the Tucson Gem Show, but they were out of cars she wanted to drive. Could this happen with Cloud Computing?
Not likely. With rental cars, the cars have to be physically in Tucson to rent them. Rental companies could have brought cars down from Phoenix to satisfy demand. With Cloud Computing, it is all accessible over the global network, you are not limited to the cloud providers nearest you.
Is there a reason why Amazon Web Services (AWS) charges more for a Windows image than a Linux image?
Yes, Amazon and Microsoft have a patent cross-licensing agreement where Amazon pays Microsoft for the priveledge of offering Windows-based images on their EC2 cloud infrastructure. It just makes business sense to pass those costs onto the consumer. Linux is a free open source operating system, and is often the better choice.
So if we rent a machine from Amazon, they send it to my accounting office? What exactly am I getting for 12 cents per hour?
No. The computer remains in their datacenter. You get a virtual machine that runs 1.2Ghz Intel processor, with 1700MB of RAM, and 160GB of hard disk space, with Windows operating system running on it, comparable to a machine you can get at the local BestBuy, but instead of it running in the next room, it is running in a datacenter somewhere else in the United States with electricity and air conditioning.
You access it remotely from your desktop or laptop PC.
Why would I ever rent more than one computer?
It depends on your workload. For example, Derek Gottfrid at the New York Times needed to convert 11 million articles from TIFF format to PDF format so that he could put them up on the web. This would have taken him months using a single computer, so he rented 100 computers and got the entire stack converted in 24 hours, for a cost of about $240. See the articles [Self-Service, Prorated, Super Computing] and [TimesMachine] for details.
What about throughput? Won't I need to run cables from my accounting office to this cloud computing data center?
You will need connectivity, most likely from connections provided by your local telephone or cable company, or through the Internet. Certainly, there can be cases where direct privately-owned fiber optic cables, known as "dark fiber", can directly connect consumers to local Cloud service providers, for added security.
What about medical records? Will Cloud Computing help the Healthcare industry?
Yes, hospitals are finding that digitizing their records greatly reduces costs. IBM offers the Grid Medical Archive Solution [GMAS] as a private cloud storage solution to store X-ray images and other electronic medical records on disk and tape, and these records can be accessed from multiple hospitals and clinics, wherever the doctor or patient happens to be.
The advantage of personal computers was individualization, I could put on my own choices of software, and customize my own settings, won't we lose this with Cloud Computing?
Yes, customized software and settings cost companies millions of dollars with help desk calls. Cloud Computing attempts to provide some standardization, reducing the amount of effort to support IT operations.
Won't putting all the computers into a big datacenter make them more vulnerable to hackers?
Security is a well-known concern, but this is being addressed with encryption, access control lists, multi-tenancy isolation, and VPN connections.
My daughter has a BlackBerry or iPod or something, and when we mentioned that someone in Phoenix wore a monkey suit to avoid photo-radar speed cameras, she was able to pull up a picture on her little hand-held thing, is this the future?
Yes, mobile phones and other hand-held devices now have internet access to take advantage of Cloud Computing services. People will be able to access the information they need from wherever they happen to be. (You can see the picture here: [Man Dons Mask for Speed-Camera Photos])
IBM offers a variety of Cloud Computing services, as well as customized solutions and integrated systems that can be deployed on-premises behind your corporate firewall. To learn more, go to [ibm.com/cloud].
The second speaker was local celebrity Dan Ryan presenting the financials for the upcoming [Rosemont Copper] mining operations. Copper is needed for emerging markets, such as hybrid vehicles and wind turbines. Copper is a major industry in Arizona.
Continuing my coverage of the Data Center Conference 2009, held Dec 1-4 in Las Vegas, the title of this session refers to the mess of "management standards" for Cloud Computing.
The analyst quickly reviewed the concepts of IaaS (Amazon EC2, for example), PaaS (Microsoft Azure, for example), and SaaS (IBM LotusLive, for example). The problem is that each provider has developed their own set of APIs.
(One exception was [Eucalyptus], which adopts the Amazon EC2, S3 and EBS style of interfaces. Eucalyptus is an open-source infrastrcture that stands for "Elastic Utility Computing Architecture Linking Your Programs To Useful Systems". You can build your own private cloud using the new Cloud APIs included Ubuntu Linux 9.10 Karmic Koala termed Ubuntu Enterprise Cloud (UEC). See these instructions in InformationWeek article [Roll Your Own Ubuntu Private Cloud].)
The analyst went into specific Virtual Infrastructure (VI) and public cloud providers.
Private Clouds can be managed by VMware tools. For remote management of public IaaS clouds, there is [vCloud Express], and for SaaS, a new service called [VMware Go].
Citrix is the Open Service Champion. For private clouds based on Xen Server, they have launched the [Xen Cloud Project] to help manage. For public clouds, they have [Citrix Cloud Center, C3], including an Amazon-based "Citrix C3 Labs" for developing and testing applications. For SaaS, they have [GoToMyPC and [GoToAssist].
Amazon offers a set of Cloud computing capabilities called Amazon Web Services [AWS]. For virtual private clouds, use the AWS Management Console. For IaaS (Amazon EC2), use [CloudWatch] which includes Elastic Load Balancing.
If you prefer a common management system independent of cloud provider, or perhaps across multiple cloud providers, you may want to consider one of the "Big 4" instead. These are the top four system management software vendors: IBM, HP, BMC Software, and Computer Associates (CA).
A survey of the audience found the number one challenge was "integration". How to integrate new cloud services into an existing traditional data center. Who will give you confidence to deliver not tools for remote management of external cloud services? Survey shows:
28 percent: VI Providers (VMware, Citrix, Microsoft)
19 percent: Big 4 System Management software vendors (IBM, HP, BMC, CA)
13 percent: Public cloud providers (Amazon, Google)
40 percent: Other/Don't Know
For internal private on-promise Clouds, the results were different:
40 percent: VI Providers (VMware, Citrix, Microsoft)
21 percent: Big 4 System Management software vendors (IBM, HP, BMC, CA)
13 percent: Emerging players (Eucalyptus)
26 percent: Other/Don't Know
Some final thoughts offered by the analyst. First, nearly a third of all IT vendors disappear after two years, and the cloud will probably have similar, if not worse, track record. Traditional server, storage and network administrators should not consider Cloud technologies as a death knell for in-house on-premises IT. Companies should probably explore a mix of private and public cloud options.
For the longest time, people thought that humans could not run a mile in less than four minutes. Then, in 1954, [Sir Roger Bannister] beat that perception, and shortly thereafter, once he showed it was possible, many other runners were able to achieve this also. The same is being said now about the IBM Watson computer which appeared this week against two human contestants on Jeopardy!
(2014 Update: A lot has happened since I originally wrote this blog post! I intended this as a fun project for college students to work on during their summer break. However, IBM is concerned that some businesses might be led to believe they could simply stand up their own systems based entirely on open source and internally developed code for business use. IBM recommends instead the [IBM InfoSphere BigInsights] which packages much of the software described below. IBM has also launched a new "Watson Group" that has [Watson-as-a-Service] capabilities in the Cloud. To raise awareness to these developments, IBM has asked me to rename this post from IBM Watson - How to build your own "Watson Jr." in your basement to the new title IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement. I also took this opportunity to improve the formatting layout.)
Often, when a company demonstrates new techology, these are prototypes not yet ready for commercial deployment until several years later. IBM Watson, however, was made mostly from commercially available hardware, software and information resources. As several have noted, the 1TB of data used to search for answers could fit on a single USB drive that you buy at your local computer store.
Take a look at the [IBM Research Team] to determine how the project was organized. Let's decide what we need, and what we don't in our version for personal use:
Do we need it for personal use?
Yes, That's you. Assuming this is a one-person project, you will act as Team Lead.
Yes, I hope you know computer programming!
No, since this version for personal use won't be appearing on Jeopardy, we won't need strategy on wager amounts for the Daily Double, or what clues to pick next. Let's focus merely on a computer that can accept a question in text, and provide an answer back, in text.
Yes, this team focused on how to wire all the hardware together. We need to do that, although this version for personal use will have fewer components.
Optional. For now, let's have this version for personal use just return its answer in plain text. Consider this Extra Credit after you get the rest of the system working. Consider using [eSpeak], [FreeTTS], or the Modular Architecture for Research on speech sYnthesis [MARY] Text-to-Speech synthesizers.
Yes, I will explain what this is, and why you need it.
Yes, we will need to get information for personal use to process
Yes, this team developed a system for parsing the question being asked, and to attach meaning to the different words involved.
No, this team focused on making IBM Watson optimized to answer in 3 seconds or less. We can accept a slower response, so we can skip this.
(Disclaimer: As with any Do-It-Yourself (DIY) project, I am not responsible if you are not happy with your version for personal use I am basing the approach on what I read from publicly available sources, and my work in Linux, supercomputers, XIV, and SONAS. For our purposes, this version for personal use is based entirely on commodity hardware, open source software, and publicly available sources of information. Your implementation will certainly not be as fast or as clever as the IBM Watson you saw on television.)
Step 1: Buy the Hardware
Supercomputers are built as a cluster of identical compute servers lashed together by a network. You will be installing Linux on them, so if you can avoid paying extra for Microsoft Windows, that would save you some money. Here is your shopping list:
Three x86 hosts, with the following:
64-bit quad-core processor, either Intel-VT or AMD-V capable,
8GB of DRAM, or larger
300GB of hard disk, or larger
CD or DVD Read/Write drive
Computer Monitor, mouse and keyboard
Ethernet 1GbE 4-port hub, and appropriate RJ45 cables
Surge protector and Power strip
Local Console Monitor (LCM) 4-port switch (formerly known as a KVM switch) and appropriate cables. This is optional, but will make it easier during the development. Once your implementation is operational, you will only need the monitor and keyboard attached to one machine. The other two machines can remain "headless" servers.
Step 2: Establish Networking
IBM Watson used Juniper switches running at 10Gbps Ethernet (10GbE) speeds, but was not connected to the Internet while playing Jeopardy! Instead, these Ethernet links were for the POWER7 servers to talk to each other, and to access files over the Network File System (NFS) protocol to the internal customized SONAS storage I/O nodes.
The implementation will be able to run "disconnected from the Internet" as well. However, you will need Internet access to download the code and information sources. For our purposes, 1GbE should be sufficient. Connect your Ethernet hub to your DSL or Cable modem. Connect all three hosts to the Ethernet switch. Connect your keyboard, video monitor and mouse to the LCM, and connect the LCM to the three hosts.
Step 3: Install Linux and Middleware
To say I use Linux on a daily basis is an understatement. Linux runs on my Android-based cell phone, my laptop at work, my personal computers at home, most of our IBM storage devices from SAN Volume Controller to XIV to SONAS, and even on my Tivo at home which recorded my televised episodes of Jeopardy!
For this project, you can use any modern Linux distribution that supports KVM. IBM Watson used Novel SUSE Linux Enterprise Server [SLES 11]. Alternatively, I can also recommend either Red Hat Enterprise Linux [RHEL 6] or Canonical [Ubuntu v10]. Each distribution of Linux comes in different orientations. Download the the 64-bit "ISO" files for each version, and burn them to CDs.
Graphical User Interface (GUI) oriented, often referred to as "Desktop" or "HPC-Head"
Command Line Interface (CLI) oriented, often referred to as "Server" or "HPC-Compute"
Guest OS oriented, to run in a Hypervisor such as KVM, Xen, or VMware. Novell calls theirs "Just Enough Operating System" [JeOS].
For this version for personal use, I have chosen a [multitier architecture], sometimes referred to as an "n-tier" or "client/server" architecture.
Host 1 - Presentation Server
For the Human-Computer Interface [HCI], the IBM Watson received categories and clues as text files via TCP/IP, had a [beautiful avatar] representing a planet with 42 circles streaking across in orbit, and text-to-speech synthesizer to respond in a computerized voice. Your implementation will not be this sophisticated. Instead, we will have a simple text-based Query Panel web interface accessible from a browser like Mozilla Firefox.
Host 1 will be your Presentation Server, the connection to your keyboard, video monitor and mouse. Install the "Desktop" or "HPC Head Node" version of Linux. Install [Apache Web Server and Tomcat] to run the Query Panel. Host 1 will also be your "programming" host. Install the [Java SDK] and the [Eclipse IDE for Java Developers]. If you always wanted to learn Java, now is your chance. There are plenty of books on Java if that is not the language you normally write code.
While three little systems doesn't constitute an "Extreme Cloud" environment, you might like to try out the "Extreme Cloud Administration Tool", called [xCat], which was used to manage the many servers in IBM Watson.
Host 2 - Business Logic Server
Host 2 will be driving most of the "thinking". Install the "Server" or "HPC Compute Node" version of Linux. This will be running a server virtualization Hypervisor. I recommend KVM, but you can probably run Xen or VMware instead if you like.
Host 3 - File and Database Server
Host 3 will hold your information sources, indices, and databases. Install the "Server" or "HPC Compute Node" version of Linux. This will be your NFS server, which might come up as a question during the installation process.
Technically, you could run different Linux distributions on different machines. For example, you could run "Ubuntu Desktop" for host 1, "RHEL 6 Server" for host 2, and "SLES 11" for host 3. In general, Red Hat tries to be the best "Server" platform, and Novell tries to make SLES be the best "Guest OS".
My advice is to pick a single distribution and use it for everything, Desktop, Server, and Guest OS. If you are new to Linux, choose Ubuntu. There are plenty of books on Linux in general, and Ubuntu in particular, and Ubuntu has a helpful community of volunteers to answer your questions.
Step 4: Download Information Sources
You will need some documents for your implementation to process.
IBM Watson used a modified SONAS to provide a highly-available clustered NFS server. For this version, we won't need that level of sophistication. Configure Host 3 as the NFS server, and Hosts 1 and 2 as NFS clients. See the [Linux-NFS-HOWTO] for details. To optimize performance, host 3 will be the "official master copy", but we will use a Linux utility called rsync to copy the information sources over to the hosts 1 and 2. This allows the task engines on those hosts to access local disk resources during question-answer processing.
We will also need a relational database. You won't need a high-powered IBM DB2. Your implementation can do fine with something like [Apache Derby] which is the open source version of IBM CloudScape from its Informix acquisition. Set up Host 3 as the Derby Network Server, and Hosts 1 and 2 as Derby Network Clients. For more about structured content in relational databases, see my post [IBM Watson - Business Intelligence, Data Retrieval and Text Mining].
Linux includes a utility called wget which allows you to download content from the Internet to your system. What documents you decide to download is up to you, based on what types of questions you want answered. For example, if you like Literature, check out the vast resources at [FullBooks.com]. You can automate the download by writing a shell script or program to invoke wget to all the places you want to fetch data from. Rename the downloaded files to something unique, as often they are just "index.html". For more on wget utility, see [IBM Developerworks].
Step 5: The Query Panel - Parsing the Question
Next, we need to parse the question and have some sense of what is being asked for. For this we will use [OpenNLP] for Natural Language Processing, and [OpenCyc] for the conceptual logic reasoning. See Doug Lenat presenting this 75-minute video [Computers versus Common Sense]. To learn more, see the [CYC 101 Tutorial].
Unlike Jeopardy! where Alex Trebek provides the answer and contestants must respond with the correct question, we will do normal Question-and-Answer processing. To keep things simple, we will limit questions to the following formats:
Who is ...?
Where is ...?
When did ... happen?
What is ...?
Host 1 will have a simple Query Panel web interface. At the top, a place to enter your question, and a "submit" button, and a place at the bottom for the answer to be shown. When "submit" is pressed, this will pass the question to "main.jsp", the Java servlet program that will start the Question-answering analysis. Limiting the types of questions that can be posed will simplify hypothesis generation, reduce the candidate set and evidence evaluation, allowing the analytics processing to continue in reasonable time.
Step 6: Unstructured Information Management Architecture
The "heart and soul" of IBM Watson is Unstructured Information Management Architecture [UIMA]. IBM developed this, then made it available to the world as open source. It is maintained by the [Apache Software Foundation], and overseen by the Organization for the Advancement of Structured Information Standards [OASIS].
Basically, UIMA lets you scan unstructured documents, gleam the important points, and put that into a database for later retrieval. In the graph above, DBs means 'databases' and KBs means 'knowledge bases'. See the 4-minute YouTube video of [IBM Content Analytics], the commercial version of UIMA.
Starting from the left, the Collection Reader selects each document to process, and creates an empty Common Analysis Structure (CAS) which serves as a standardized container for information. This CAS is passed to Analysis Engines , composed of one or more Annotators which analyze the text and fill the CAS with the information found. The CAS are passed to CAS Consumers which do something with the information found, such as enter an entry into a database, update an index, or update a vote count.
(Note: This point requires, what we in the industry call a small matter of programming, or [SMOP]. If you've always wanted to learn Java programming, XML, and JDBC, you will get to do plenty here. )
If you are not familiar with UIMA, consider this [UIMA Tutorial].
Step 7: Parallel Processing
People have asked me why IBM Watson is so big. Did we really need 2,880 cores of processing power? As a supercomputer, the 80 TeraFLOPs of IBM Watson would place it only in 94th place on the [Top 500 Supercomputers]. While IBM Watson may be the [Smartest Machine on Earth], the most powerful supercomputer at this time is the Tianhe-1A with more than 186,000 cores, capable of 2,566 TeraFLOPs.
To determine how big IBM Watson needed to be, the IBM Research team ran the DeepQA algorithm on a single core. It took 2 hours to answer a single Jeopardy question! Let's look at the performance data:
Number of cores
Time to answer one Jeopardy question
Single IBM Power750 server
< 4 minutes
Single rack (10 servers)
< 30 seconds
IBM Watson (90 servers)
< 3 seconds
The old adage applies, [many hands make for light work]. The idea is to divide-and-conquer. For example, if you wanted to find a particular street address in the Manhattan phone book, you could dispatch fifty pages to each friend and they could all scan those pages at the same time. This is known as "Parallel Processing" and is how supercomputers are able to work so well. However, not all algorithms lend well to parallel processing, and the phrase [nine women can't have a baby in one month] is often used to remind us of this.
Fortuantely, UIMA is designed for parallel processing. You need to install UIMA-AS for Asynchronous Scale-out processing, an add-on to the base UIMA Java framework, supporting a very flexible scale-out capability based on JMS (Java Messaging Services) and ActiveMQ. We will also need Apache Hadoop, an open source implementation used by Yahoo Search engine. Hadoop has a "MapReduce" engine that allows you to divide the work, dispatch pieces to different "task engines", and the combine the results afterwards.
Host 2 will run Hadoop and drive the MapReduce process. Plan to have three KVM guests on Host 1, four on Host 2, and three on Host 3. That means you have 10 task engines to work with. These task engines can be deployed for Content Readers, Analysis Engines, and CAS Consumers. When all processing is done, the resulting votes will be tabulated and the top answer displayed on the Query Panel on Host 1.
Step 8: Testing
To simplify testing, use a batch processing approach. Rather than entering questions by hand in the Query Panel, generate a long list of questions in a file, and submit for processing. This will allow you to fine-tune the environment, optimize for performance, and validate the answers returned.
There you have it. By the time you get your implementation fully operational, you will have learned a lot of useful skills, including Linux administration, Ethernet networking, NFS file system configuration, Java programming, UIMA text mining analysis, and MapReduce parallel processing. Hopefully, you will also gain an appreciation for how difficult it was for the IBM Research team to accomplish what they had for the Grand Challenge on Jeopardy! Not surprisingly, IBM Watson is making IBM [as sexy to work for as Apple, Google or Facebook], all of which started their business in a garage or a basement with a system as small as this version for personal use.
They say "Great Minds think alike" and that imitation is "the sincerest form of flattery." Both of these quotes came to mind when I read fellow blogger Chuck Hollis' (EMC) excellent April 7th blog post [The 10 Big Ideas That Are Shaping IT Infrastructure Today]. Not surprisingly, some of his thoughts are similar to those I had presented two weeks ago in my March 22nd post [Cloud Computing for Accountants]. Here are two charts that caught my eye:
On page 13 of my deck, I had an old black and white photo of telephone operators, as part of a section on the history of selecting "cloud" as the iconic graphic to represent all networks. Chuck has this same graphic on his chart titled "#1 The Industrialization of IT Infrastructure".
Looks like Chuck and I use the same "stock photo" search facility!
On page 45 on my deck, I had a list of major "arms dealers" that deliver the hardware and software components needed to build Cloud Computing. Chuck has a similar chart, titled "#2 The Consolidation of the IT Industry", but with some interesting differences.
Let's look at some of the key differences:
The left-to-right order is slightly different. I chose a 1-2-4-2-1 symmetrical pattern purely on aesthetic reasons. My presentation was to a bunch of accountants, and so I was trying not to make it sound like an "Infomercial" for IBM products and offerings. My sequence is roughly chronological, in that Oracle announced its intention to acquire Sun, then Cisco, VMware and EMC announced their VCE coalition, followed closely by Cisco, VMware and NetApp announcing they work together well also, followed by [HP extended alliance with Microsoft] on Jan 13, 2010. As the IT marketplace is maturing, more and more customers are looking for an IBM-like one-stop shopping experience, and certainly various "mini-mall" alliances have formed to try to compete in this space.
I had HP and Microsoft in the same column, referring only to the above-mentioned January announcement. HP is all about private cloud hardware infrastructures, but Microsoft is all about "three screens and the public cloud", so not sure how well this alliance will work out from a Cloud Computing perspective. This was not to imply that the other stacks don't work well with Microsoft software. They all do. Perhaps to avoid that controversy, Chuck chose to highlight HP's acquisition of EDS services instead.
I used the vendor logos in their actual colors. Notice that the colors black, blue and red occur most often. These happen to be the three most popular ballpoint pen ink colors found on the very same paper documents these computer companies are trying to eliminate. Paper-less office, anyone? Chuck chose instead to colorize each stack with his own color scheme. While blue for IBM and orange for Sun Microsystems make some sense, it is not clear if he chose green for Cisco/VMware/EMC for any particular reason. Perhaps he was trying to subtly imply that the VCE stack is more energy efficient? Or maybe the green refers to money to indicate that the VCE stack is the most expensive? Either way, I would pit IBM's server/storage/software stack up against anything of comparable price from these other stacks in any energy efficiency bake-off.
What about the Cisco/VMware/NetApp combination? All three got together to assure customers this was a viable combination. IBM is the number one reseller of VMware, and VMware runs great with IBM's N series NAS storage, so I do not dispute Cisco's motivation here. It makes sense for Cisco to two-time EMC in this manner. Why should Cisco limit itself to a single storage supplier? Et tu VMware? Having VMware chose NetApp over its parent company EMC was a bit of a shock. No surprise that Chuck left NetApp out of his chart.
No love for Dell? I give Dell credit for their work with Virtual Desktop Images (VDI), and for embracing Ubuntu Linux for their servers. Dell's acquisitions of EqualLogic iSCSI-based disk systems and Perot Systems for services are also worth noting. Dell used to resell some of EMC's gear, but perhaps that relationship continues to fade away, as I [predicted back in 2007]. Chuck's decision to leave Dell off his chart speaks volumes to where this relationship stands, and where it is going.
Perhaps we are all in just one big ["echo chamber"], as we are all coming up with similar observations, talking to similar customers, and reviewing similar market analyst reports. I am glad, at least this time, that Chuck and I for the most part agree where the marketplace is going. We live in interesting times!