Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
Since the [IBM System Storage Technical University 2011] runs concurrently with the System x Technical University, attendees are allowed to mix-and-match. I attended several presentations regarding server virtualization and hypervisors.
Matt Archibald is an IT Management Consultant in IBM's Systems Agenda Delivery team. He started with a history of hypervisors, from IBM's early CP/CMS in 1967, through the latest VMware Vsphere 5 just announced.
He explained that there are three types of Hypervisor architectures today:
Type 1 - often referred to as "Bare Metal" runs directly on the server host hardware, and allows different operating system virtual machines to run as guests. IBM's System z [PR/SM] and [PowerVM] as well as the popular VMware ESXi are examples of this type.
Type 2 - often referred to as "Hosted" runs above an existing operating system, and allows different operating system virtual machines to run as guests. The popular [Oracle/Sun VirtualBox] is an example of this type.
OS Containers - runs above an existing operating system base, and allows multiple "guests" that all run the same operating system as the base. This affords some isolation between applications. [Parallels Virtuozzo Containers] is an example of this type.
The dominant architecture is Type 1. For x86, IBM is the number one reseller of VMware. VMware recently announced [Vsphere 5], which changes its licensing model from CPU-based to memory-based. For example, a virtual machine with 32 virtual CPUs and 1TB of virtual RAM (VRAM) would cost over $73,000 per year to license the VMware "Enterprise Plus" software. The only plus-side to this new licensing is that the "memory" entitlement transfers during Disaster Recovery to the remote location.
"Xen is dead." was the way Matt introduced the section discussing Hybrid Type-1 hypervisors like Xen and Hyper-V. These run bare-metal, but require networking and storage I/O to be processed by a single bottleneck partition referred to as "Dom 0". As such, this hybrid approach does not scale well on larger multi-sock host servers. So, his Xen-is-dead message was referring to all Hybrid-based Hypervisors including Hyper-V, not just those based on Xen itself.
The new up-and-comer is "Linux KVM". Last year, in my blog post about [System x KVM solutions], I mentioned the confusion over KVM acronym used with two different meanings. Many people use KVM to refer to Keyboard-Video-Mouse switches that allow access to multiple machines. IBM has renamed these switches to Local Console Managers (LCM) and Global Console Manager (GCM). This year, the System x team have adopted the use of "Linux KVM" to refer to the second meaning, the [Kernel-based Virtual Machine] hypervisor.
Linux KVM is not a product, but an open-source project. As such, it is built into every Linux kernel. Red Hat has created two specific deliverables under the name Red Hat Enterprise Virtualization (RHEV):
RHEV-H, a tiny ESXi-like bare-metal hypervisor that fits in 78MB, making it small enough to be on a USB stick, CD-rom or memory chip.
RHEV-M, a vCenter-like management software to manage multiple virtual machines across multiple hosts.
Personally, I run RHEL 6.1 with KVM on my IBM laptop as my primary operating system, with a Windows XP guest image to run a few Windows-specific applications.
A complaint of the current RHEV 2.2 release from Linux fanboys is that RHEV-M requires a Windows server, and uses Windows Powershell for scripting. The next release of RHEV is likely to provide a Linux-based option for management server.
Of the various hypervisors evaluated, KVM appears to be poised to offer the best scalability for multi-socket host machines. The next release is expected to support up to 4096 threads, 64TB of RAM, and over 2000 virtual machines. Compare that to VMware Vsphere 5 that supports only 160 threads, 2TB of RAM and up to 512 virtual machines.
Linux KVM Overview
Matt also presented a session focused on Linux KVM. While IBM is the leading reseller of VMware for the x86 server platform, it has chosen Linux KVM to run all of its internal x86 Cloud Computing facilities, as it can offer 40 to 80 percent savings, based on Total Cost of Ownership (TCO).
Linux KVM can run unmodified Windows and Linux guest operating systems as guest images with less than 5 percent overhead. Since KVM is built into the Linux kernel, any certification testing automatically benefits KVM as well. KVM takes advantage of modern CPU extensions like Intel's VT and AMD's AMD-V.
For high availability, in the event that a host fails, KVM can restart the guest images on other KVM hosts. RHEV offers "prioritized restart order" which allows mision-critical images to be started before less important ones.
RHEV also provides "Virtual Desktop Infrastructure", known as VDI. This allows a lightweight client with a browser to access an OS image running on a KVM host. Matt was able to demonstrate this with Firefox browser running on his Android-based Nexus One smartphone.
RHEV also adds features that make it ideal for cloud deployments, including hot-pluggable CPU, network and storage; service Level Agreement monitoring for CPU, memory and I/O resources; storage live migrations to move the raw image files while guests are running; and a self-service user portal.
IBM has been doing server virtualization for decades. When I first started at IBM in 1986, I was doing z/OS development and testing on z/VM guest images. Later, around 1999, I started working with the "Linux on z" team, running multiple Linux images under PR/SM and z/VM. While the server virtualization solutions most people are familiar with (VMware, Hyper-V, Xen) have only been around the last five years or so, IBM has a much deeper and robust understanding and long heritage. This helps to set IBM apart from the competition when helping clients.
Bill Bauman, IBM System x Field Technical Support Specialist and System x University celebrity, presented the differences between Grid, SOA and Cloud Computing. I thought this was an odd combination to compare and contrast, but his presentation was well attended.
Grid - this is when two or more independently owned and managed computers are brought together to solve a problem. Some research facilities do this. IBM helped four hospitals connect their computers together into a grid to help analyze breast cancer. IBM also supports the [World Community Grid] which allows your personal computer to be connected to the grid and help process calculations.
SOA - SOA, which stands for Service Oriented Architecture, is an approach to building business applications as a combination of loosely-coupled black-box components orchestrated to deliver a well-defined level of service by linking together business processes. I often explain SOA as the the business version of Web 2.0. You can download a free copy of the eBook "SOA for Dummies" at the [IBM Smart SOA] landing page.
Cloud - A Cloud is a dynamic, scalable, expandable, and completely contractible architecture. It may consist of multiple, disparate, on-premise and off-premise hardware and virtualized platforms hosting legacy, fully installed, stateless, or virtualized instances of operating systems and application workloads.
Tom Vezina, IBM Advanced Technical Sales Specialist, presented "Chaos to Cloud Computing". Survey results show that roughly 70 percent of cloud spend will be for private clouds, and 30 percent for public, hybrid or community clouds. Of the key motivations for public cloud, 77 percent or respondents cited reducing costs, 72 percent time to value, and 50 percent improving reliability.
Tom ran over 500 "server utilization" studies for x86 deployments during the past eight years. Of these, the worst was 0.52 percent CPU utilization, the best was 13.4 percent, and the average was 6.8 percent. When IBM mentions that 85 percent of server capacity is idle, it is mostly due to x86 servers. At this rate, it seems easy to put five to 20 guest images onto a machine. However, many companies encounter "VM stall" where they get stuck after only 25 percent of their operating system images virtualized.
He feels the problem is with the fact most Physical-to-Virtual (P2V) migrations are manual efforts. There are tools available like Novell [PlateSpin Recon] to help automate and reduce the total number of hours spent per migration.
System x KVM Solutions
Boy, I walked into this one. Many of IBM's cloud offerings are based on the Linux hypervisor called Kernel-based Virtual Machine [a href="http://www.linux-kvm.org/page/Main_Page">KVM] instead of VMware or Microsoft Hyper-V. However, this session was about the "other KVM": keyboard video and mouse switches, which thankfully, IBM has renamed to Console Managers to avoid confusion. Presenters Ben Hilmus (IBM) and Steve Hahn (Avocent) presented IBM's line of Local Console Managers (LCM) and Global Console Managers (GCM) products.
LCM are the traditional KVM switches that people are familiar with. A single keyboard, video and mouse can select among hundreds of servers to perform maintenance or check on status. GCM adds KVM-over-IP capabilities, which means that now you can access selected systems over the Ethernet from a laptop or personal computer. Both LCM and GCM allow for two-level tiering, which means that you can have an LCM in each rack, and an LCM or GCM that points to each rack, greatly increasing the number of servers that can be managed from a single pane of glass.
Many severs have a "service processor" to manage the rest of the machine. IBM RSA II, HP iLO, and Dell DRAC4 are some examples. These allow you to turn on and off selected servers. IBM BladeCenter offers an Management Module that allows the chassis to be connected to a Console Manager and select a specific blade server inside. These can also be used with VMware viewer, Virtual Network Computing (VNC), or Remote Desktop Protocol (RDP).
IBM's offerings are unique it that you can have an optical CD/DVD drive or USB external storage attached at the LCM or GCM, and make it look like the storage is attached to the selected server. This can be used to install or upgrade software, transfer log files, and so on. Another great use, and apparently the motivation for having this session in the "Federal Track", is that the USB can be used to attach a reader for a smart card, known as a Common Access Card [CAC] used by various government agencies. This provides two-factor authentication [TFA]. For example, to log into the system, you enter your password (something you know) and swipe your employee badge smart card (something you have). The combination are validated at the selected server to provide access.
I find it amusing that server people limit themselves to server sessions, and storage people to storage sessions. Sometimes, you have to step "outside your comfort zone" and learn something new, something different. Open your eyes and look around a bit. You might just be surprised what you find.
(FTC note: I work for IBM. IBM considers Novell a strategic Linux partner. Novell did not provide me a copy of Platespin Recon, I have no experience using it, and I mention it only in context of the presentation made. IBM resells Avocent solutions, and we use LCM gear in the Tucson Executive Briefing Center.)
Continuing my coverage of the [Data Center 2010 conference], Tuesday afternoon I presented "Choosing the Right Storage for your Server Virtualization". In 2008 and 2009, I attended this conference as a blogger only, but this time I was also a presenter.
The conference asked vendors to condense their presentations down to 20 minutes. I am sure this was inspired by the popular 18-minute lectures from the [TED conference] or perhaps the [Pecha Kucha] night gatherings in Japan where each presenter speaks while showing 20 slides for 20 seconds each, This forces the presenters to focus on their key points and not fill the time slot with unnecessary marketing fluff. This also allows more vendors to have a chance to pitch their point of view.
Well, it's Wednesday, and you know what that means... IBM Announcements!
(Actually most IBM announcements are on Tuesdays, but IBM gave me extra time to recover from my trip to Europe!)
Today, IBM announced [IBM PureSystems], a new family of expert-integrated systems that combine storage, servers, networking, and software, based on IBM's decades of experience in the IT industry. You can register for the [Launch Event] today (April 11) at 2pm EDT, and download the companion "Integrated Expertise" event app for Apple, Android or Blackberry smartphones.
(If you are thinking, "Hey, wait a minute, hasn't this been done before?" you are not alone. Yes, IBM introduced the System/360 back in 1964, and the AS/400 back in 1988, so today's announcement is on scheduled for this 24-year cycle. Based on IBM's past success in this area, others have followed, most recently, Oracle, HP and Cisco.)
Initially, there are two offerings:
IBM PureFlex™ System
IBM PureFlex is like IaaS-in-a-box, allowing you to manage the system as a pool of virtual resources. It can be used for private cloud deployments, hybrid cloud deployments, or by service providers to offer public cloud solutions. IBM drinks its own champagne, and will have no problem integrating these into its [IBM SmartCloud] offerings.
To simplify ordering, the IBM PureFlex comes in three tee-shirt sizes: Express, Standard and Enterprise.
IBM PureFlex is based on a 10U-high, 19-inch wide, standard rack-mountable chassis that holds 14 bays, organized in a 7 by 2 matrix. Unlike BladeCenter where blades are inserted vertically, the IBM PureFlex nodes are horizontal. Some of the nodes take up a single bay (half-wide), but a few are full-wide, take up two bays, the full 19-inch width of the chassis. Compute and storage snap in the front, while power supplies, fans, and networking snap in the back. You can fit up to four chassis in a standard 42U rack.
Unlike competitive offerings, IBM does not limit you to x86 architectures. Both x86 and POWER-based compute nodes can be mixed into a single chassis. Out of the box, the IBM PureFlex supports four operating systems (AIX, IBM i, Linux and Windows), four server hypervisors (Hyper-V, Linux KVM, PowerVM, and VMware), and two storage hypervisors (SAN Volume Controller and Storwize V7000).
There are a variety of storage options for this. IBM will offer SSD and HDD inside the compute nodes themselves, direct-attached storage nodes, and an integrated version of the Storwize V7000 disk system. Of course, every IBM System Storage product is supported as external storage. Since Storwize V7000 and SAN Volume Controller support external virtualization, many non-IBM devices will be supported automatically as well.
Networking is also optimized, with options for 10Gb and 40Gb Ethernet/FCoE, 40Gb and 56Gb Infiniband, 8Gbps and 16Gbps Fibre Channel. Much of the networking traffic can be handled within the chassis, to minimize traffic on external switches and directors.
For management, IBM offers the Flex System Manager, that allows you to manage all the resources from a single pane of glass. The goal is to greatly simplify the IT lifecycle experience of procurement, installation, deployment and maintenance.
IBM PureApplication™ System
IBM PureApplication is like PaaS-in-a-box. Based on the IBM PureFlex infrastructure, the IBM PureApplication adds additional software layers focused on transactional web, business logic, and database workloads. Initially, it will offer two platforms: Linux platform based on x86 processors, Linux KVM and Red Hat Enterprise Linux (RHEL); and a UNIX platform based on POWER7 processors, PowerVM and AIX operating system. It will be offered in four tee-shirt sizes (small, medium, large and extra large).
In addition to having IBM's middleware like DB2 and WebSphere optimized for this platform, over 600 companies will announce this week that they will support and participate in the IBM PureSystems ecosystem as well. Already, there are 150 "Patterns of Expertise" ready to deploy from IBM PureSystem Centre, a kind of a "data center app store", borrowing an idea used today with smartphones.
By packaging applications in this manner, workloads can easily shift between private, hybrid and public clouds.
If you are unhappy with the inflexibility of your VCE Vblock, HP Integrity, or Oracle ExaLogic, talk to your local IBM Business Partner or Sales Representative. We might be able to buy your boat anchor off your hands, as part of an IBM PureSystems sale, with an attractive IBM Global Financing plan.
For the longest time, people thought that humans could not run a mile in less than four minutes. Then, in 1954, [Sir Roger Bannister] beat that perception, and shortly thereafter, once he showed it was possible, many other runners were able to achieve this also. The same is being said now about the IBM Watson computer which appeared this week against two human contestants on Jeopardy!
(2014 Update: A lot has happened since I originally wrote this blog post! I intended this as a fun project for college students to work on during their summer break. However, IBM is concerned that some businesses might be led to believe they could simply stand up their own systems based entirely on open source and internally developed code for business use. IBM recommends instead the [IBM InfoSphere BigInsights] which packages much of the software described below. IBM has also launched a new "Watson Group" that has [Watson-as-a-Service] capabilities in the Cloud. To raise awareness to these developments, IBM has asked me to rename this post from IBM Watson - How to build your own "Watson Jr." in your basement to the new title IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement. I also took this opportunity to improve the formatting layout.)
Often, when a company demonstrates new techology, these are prototypes not yet ready for commercial deployment until several years later. IBM Watson, however, was made mostly from commercially available hardware, software and information resources. As several have noted, the 1TB of data used to search for answers could fit on a single USB drive that you buy at your local computer store.
Take a look at the [IBM Research Team] to determine how the project was organized. Let's decide what we need, and what we don't in our version for personal use:
Do we need it for personal use?
Yes, That's you. Assuming this is a one-person project, you will act as Team Lead.
Yes, I hope you know computer programming!
No, since this version for personal use won't be appearing on Jeopardy, we won't need strategy on wager amounts for the Daily Double, or what clues to pick next. Let's focus merely on a computer that can accept a question in text, and provide an answer back, in text.
Yes, this team focused on how to wire all the hardware together. We need to do that, although this version for personal use will have fewer components.
Optional. For now, let's have this version for personal use just return its answer in plain text. Consider this Extra Credit after you get the rest of the system working. Consider using [eSpeak], [FreeTTS], or the Modular Architecture for Research on speech sYnthesis [MARY] Text-to-Speech synthesizers.
Yes, I will explain what this is, and why you need it.
Yes, we will need to get information for personal use to process
Yes, this team developed a system for parsing the question being asked, and to attach meaning to the different words involved.
No, this team focused on making IBM Watson optimized to answer in 3 seconds or less. We can accept a slower response, so we can skip this.
(Disclaimer: As with any Do-It-Yourself (DIY) project, I am not responsible if you are not happy with your version for personal use I am basing the approach on what I read from publicly available sources, and my work in Linux, supercomputers, XIV, and SONAS. For our purposes, this version for personal use is based entirely on commodity hardware, open source software, and publicly available sources of information. Your implementation will certainly not be as fast or as clever as the IBM Watson you saw on television.)
Step 1: Buy the Hardware
Supercomputers are built as a cluster of identical compute servers lashed together by a network. You will be installing Linux on them, so if you can avoid paying extra for Microsoft Windows, that would save you some money. Here is your shopping list:
Three x86 hosts, with the following:
64-bit quad-core processor, either Intel-VT or AMD-V capable,
8GB of DRAM, or larger
300GB of hard disk, or larger
CD or DVD Read/Write drive
Computer Monitor, mouse and keyboard
Ethernet 1GbE 4-port hub, and appropriate RJ45 cables
Surge protector and Power strip
Local Console Monitor (LCM) 4-port switch (formerly known as a KVM switch) and appropriate cables. This is optional, but will make it easier during the development. Once your implementation is operational, you will only need the monitor and keyboard attached to one machine. The other two machines can remain "headless" servers.
Step 2: Establish Networking
IBM Watson used Juniper switches running at 10Gbps Ethernet (10GbE) speeds, but was not connected to the Internet while playing Jeopardy! Instead, these Ethernet links were for the POWER7 servers to talk to each other, and to access files over the Network File System (NFS) protocol to the internal customized SONAS storage I/O nodes.
The implementation will be able to run "disconnected from the Internet" as well. However, you will need Internet access to download the code and information sources. For our purposes, 1GbE should be sufficient. Connect your Ethernet hub to your DSL or Cable modem. Connect all three hosts to the Ethernet switch. Connect your keyboard, video monitor and mouse to the LCM, and connect the LCM to the three hosts.
Step 3: Install Linux and Middleware
To say I use Linux on a daily basis is an understatement. Linux runs on my Android-based cell phone, my laptop at work, my personal computers at home, most of our IBM storage devices from SAN Volume Controller to XIV to SONAS, and even on my Tivo at home which recorded my televised episodes of Jeopardy!
For this project, you can use any modern Linux distribution that supports KVM. IBM Watson used Novel SUSE Linux Enterprise Server [SLES 11]. Alternatively, I can also recommend either Red Hat Enterprise Linux [RHEL 6] or Canonical [Ubuntu v10]. Each distribution of Linux comes in different orientations. Download the the 64-bit "ISO" files for each version, and burn them to CDs.
Graphical User Interface (GUI) oriented, often referred to as "Desktop" or "HPC-Head"
Command Line Interface (CLI) oriented, often referred to as "Server" or "HPC-Compute"
Guest OS oriented, to run in a Hypervisor such as KVM, Xen, or VMware. Novell calls theirs "Just Enough Operating System" [JeOS].
For this version for personal use, I have chosen a [multitier architecture], sometimes referred to as an "n-tier" or "client/server" architecture.
Host 1 - Presentation Server
For the Human-Computer Interface [HCI], the IBM Watson received categories and clues as text files via TCP/IP, had a [beautiful avatar] representing a planet with 42 circles streaking across in orbit, and text-to-speech synthesizer to respond in a computerized voice. Your implementation will not be this sophisticated. Instead, we will have a simple text-based Query Panel web interface accessible from a browser like Mozilla Firefox.
Host 1 will be your Presentation Server, the connection to your keyboard, video monitor and mouse. Install the "Desktop" or "HPC Head Node" version of Linux. Install [Apache Web Server and Tomcat] to run the Query Panel. Host 1 will also be your "programming" host. Install the [Java SDK] and the [Eclipse IDE for Java Developers]. If you always wanted to learn Java, now is your chance. There are plenty of books on Java if that is not the language you normally write code.
While three little systems doesn't constitute an "Extreme Cloud" environment, you might like to try out the "Extreme Cloud Administration Tool", called [xCat], which was used to manage the many servers in IBM Watson.
Host 2 - Business Logic Server
Host 2 will be driving most of the "thinking". Install the "Server" or "HPC Compute Node" version of Linux. This will be running a server virtualization Hypervisor. I recommend KVM, but you can probably run Xen or VMware instead if you like.
Host 3 - File and Database Server
Host 3 will hold your information sources, indices, and databases. Install the "Server" or "HPC Compute Node" version of Linux. This will be your NFS server, which might come up as a question during the installation process.
Technically, you could run different Linux distributions on different machines. For example, you could run "Ubuntu Desktop" for host 1, "RHEL 6 Server" for host 2, and "SLES 11" for host 3. In general, Red Hat tries to be the best "Server" platform, and Novell tries to make SLES be the best "Guest OS".
My advice is to pick a single distribution and use it for everything, Desktop, Server, and Guest OS. If you are new to Linux, choose Ubuntu. There are plenty of books on Linux in general, and Ubuntu in particular, and Ubuntu has a helpful community of volunteers to answer your questions.
Step 4: Download Information Sources
You will need some documents for your implementation to process.
IBM Watson used a modified SONAS to provide a highly-available clustered NFS server. For this version, we won't need that level of sophistication. Configure Host 3 as the NFS server, and Hosts 1 and 2 as NFS clients. See the [Linux-NFS-HOWTO] for details. To optimize performance, host 3 will be the "official master copy", but we will use a Linux utility called rsync to copy the information sources over to the hosts 1 and 2. This allows the task engines on those hosts to access local disk resources during question-answer processing.
We will also need a relational database. You won't need a high-powered IBM DB2. Your implementation can do fine with something like [Apache Derby] which is the open source version of IBM CloudScape from its Informix acquisition. Set up Host 3 as the Derby Network Server, and Hosts 1 and 2 as Derby Network Clients. For more about structured content in relational databases, see my post [IBM Watson - Business Intelligence, Data Retrieval and Text Mining].
Linux includes a utility called wget which allows you to download content from the Internet to your system. What documents you decide to download is up to you, based on what types of questions you want answered. For example, if you like Literature, check out the vast resources at [FullBooks.com]. You can automate the download by writing a shell script or program to invoke wget to all the places you want to fetch data from. Rename the downloaded files to something unique, as often they are just "index.html". For more on wget utility, see [IBM Developerworks].
Step 5: The Query Panel - Parsing the Question
Next, we need to parse the question and have some sense of what is being asked for. For this we will use [OpenNLP] for Natural Language Processing, and [OpenCyc] for the conceptual logic reasoning. See Doug Lenat presenting this 75-minute video [Computers versus Common Sense]. To learn more, see the [CYC 101 Tutorial].
Unlike Jeopardy! where Alex Trebek provides the answer and contestants must respond with the correct question, we will do normal Question-and-Answer processing. To keep things simple, we will limit questions to the following formats:
Who is ...?
Where is ...?
When did ... happen?
What is ...?
Host 1 will have a simple Query Panel web interface. At the top, a place to enter your question, and a "submit" button, and a place at the bottom for the answer to be shown. When "submit" is pressed, this will pass the question to "main.jsp", the Java servlet program that will start the Question-answering analysis. Limiting the types of questions that can be posed will simplify hypothesis generation, reduce the candidate set and evidence evaluation, allowing the analytics processing to continue in reasonable time.
Step 6: Unstructured Information Management Architecture
The "heart and soul" of IBM Watson is Unstructured Information Management Architecture [UIMA]. IBM developed this, then made it available to the world as open source. It is maintained by the [Apache Software Foundation], and overseen by the Organization for the Advancement of Structured Information Standards [OASIS].
Basically, UIMA lets you scan unstructured documents, gleam the important points, and put that into a database for later retrieval. In the graph above, DBs means 'databases' and KBs means 'knowledge bases'. See the 4-minute YouTube video of [IBM Content Analytics], the commercial version of UIMA.
Starting from the left, the Collection Reader selects each document to process, and creates an empty Common Analysis Structure (CAS) which serves as a standardized container for information. This CAS is passed to Analysis Engines , composed of one or more Annotators which analyze the text and fill the CAS with the information found. The CAS are passed to CAS Consumers which do something with the information found, such as enter an entry into a database, update an index, or update a vote count.
(Note: This point requires, what we in the industry call a small matter of programming, or [SMOP]. If you've always wanted to learn Java programming, XML, and JDBC, you will get to do plenty here. )
If you are not familiar with UIMA, consider this [UIMA Tutorial].
Step 7: Parallel Processing
People have asked me why IBM Watson is so big. Did we really need 2,880 cores of processing power? As a supercomputer, the 80 TeraFLOPs of IBM Watson would place it only in 94th place on the [Top 500 Supercomputers]. While IBM Watson may be the [Smartest Machine on Earth], the most powerful supercomputer at this time is the Tianhe-1A with more than 186,000 cores, capable of 2,566 TeraFLOPs.
To determine how big IBM Watson needed to be, the IBM Research team ran the DeepQA algorithm on a single core. It took 2 hours to answer a single Jeopardy question! Let's look at the performance data:
Number of cores
Time to answer one Jeopardy question
Single IBM Power750 server
< 4 minutes
Single rack (10 servers)
< 30 seconds
IBM Watson (90 servers)
< 3 seconds
The old adage applies, [many hands make for light work]. The idea is to divide-and-conquer. For example, if you wanted to find a particular street address in the Manhattan phone book, you could dispatch fifty pages to each friend and they could all scan those pages at the same time. This is known as "Parallel Processing" and is how supercomputers are able to work so well. However, not all algorithms lend well to parallel processing, and the phrase [nine women can't have a baby in one month] is often used to remind us of this.
Fortuantely, UIMA is designed for parallel processing. You need to install UIMA-AS for Asynchronous Scale-out processing, an add-on to the base UIMA Java framework, supporting a very flexible scale-out capability based on JMS (Java Messaging Services) and ActiveMQ. We will also need Apache Hadoop, an open source implementation used by Yahoo Search engine. Hadoop has a "MapReduce" engine that allows you to divide the work, dispatch pieces to different "task engines", and the combine the results afterwards.
Host 2 will run Hadoop and drive the MapReduce process. Plan to have three KVM guests on Host 1, four on Host 2, and three on Host 3. That means you have 10 task engines to work with. These task engines can be deployed for Content Readers, Analysis Engines, and CAS Consumers. When all processing is done, the resulting votes will be tabulated and the top answer displayed on the Query Panel on Host 1.
Step 8: Testing
To simplify testing, use a batch processing approach. Rather than entering questions by hand in the Query Panel, generate a long list of questions in a file, and submit for processing. This will allow you to fine-tune the environment, optimize for performance, and validate the answers returned.
There you have it. By the time you get your implementation fully operational, you will have learned a lot of useful skills, including Linux administration, Ethernet networking, NFS file system configuration, Java programming, UIMA text mining analysis, and MapReduce parallel processing. Hopefully, you will also gain an appreciation for how difficult it was for the IBM Research team to accomplish what they had for the Grand Challenge on Jeopardy! Not surprisingly, IBM Watson is making IBM [as sexy to work for as Apple, Google or Facebook], all of which started their business in a garage or a basement with a system as small as this version for personal use.