Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Systems Client Experience Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
For the longest time, people thought that humans could not run a mile in less than four minutes. Then, in 1954, [Sir Roger Bannister] beat that perception, and shortly thereafter, once he showed it was possible, many other runners were able to achieve this also. The same is being said now about the IBM Watson computer which appeared this week against two human contestants on Jeopardy!
(2014 Update: A lot has happened since I originally wrote this blog post! I intended this as a fun project for college students to work on during their summer break. However, IBM is concerned that some businesses might be led to believe they could simply stand up their own systems based entirely on open source and internally developed code for business use. IBM recommends instead the [IBM InfoSphere BigInsights] which packages much of the software described below. IBM has also launched a new "Watson Group" that has [Watson-as-a-Service] capabilities in the Cloud. To raise awareness to these developments, IBM has asked me to rename this post from IBM Watson - How to build your own "Watson Jr." in your basement to the new title IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement. I also took this opportunity to improve the formatting layout.)
Often, when a company demonstrates new techology, these are prototypes not yet ready for commercial deployment until several years later. IBM Watson, however, was made mostly from commercially available hardware, software and information resources. As several have noted, the 1TB of data used to search for answers could fit on a single USB drive that you buy at your local computer store.
Take a look at the [IBM Research Team] to determine how the project was organized. Let's decide what we need, and what we don't in our version for personal use:
Do we need it for personal use?
Yes, That's you. Assuming this is a one-person project, you will act as Team Lead.
Yes, I hope you know computer programming!
No, since this version for personal use won't be appearing on Jeopardy, we won't need strategy on wager amounts for the Daily Double, or what clues to pick next. Let's focus merely on a computer that can accept a question in text, and provide an answer back, in text.
Yes, this team focused on how to wire all the hardware together. We need to do that, although this version for personal use will have fewer components.
Optional. For now, let's have this version for personal use just return its answer in plain text. Consider this Extra Credit after you get the rest of the system working. Consider using [eSpeak], [FreeTTS], or the Modular Architecture for Research on speech sYnthesis [MARY] Text-to-Speech synthesizers.
Yes, I will explain what this is, and why you need it.
Yes, we will need to get information for personal use to process
Yes, this team developed a system for parsing the question being asked, and to attach meaning to the different words involved.
No, this team focused on making IBM Watson optimized to answer in 3 seconds or less. We can accept a slower response, so we can skip this.
(Disclaimer: As with any Do-It-Yourself (DIY) project, I am not responsible if you are not happy with your version for personal use I am basing the approach on what I read from publicly available sources, and my work in Linux, supercomputers, XIV, and SONAS. For our purposes, this version for personal use is based entirely on commodity hardware, open source software, and publicly available sources of information. Your implementation will certainly not be as fast or as clever as the IBM Watson you saw on television.)
Step 1: Buy the Hardware
Supercomputers are built as a cluster of identical compute servers lashed together by a network. You will be installing Linux on them, so if you can avoid paying extra for Microsoft Windows, that would save you some money. Here is your shopping list:
Three x86 hosts, with the following:
64-bit quad-core processor, either Intel-VT or AMD-V capable,
8GB of DRAM, or larger
300GB of hard disk, or larger
CD or DVD Read/Write drive
Computer Monitor, mouse and keyboard
Ethernet 1GbE 4-port hub, and appropriate RJ45 cables
Surge protector and Power strip
Local Console Monitor (LCM) 4-port switch (formerly known as a KVM switch) and appropriate cables. This is optional, but will make it easier during the development. Once your implementation is operational, you will only need the monitor and keyboard attached to one machine. The other two machines can remain "headless" servers.
Step 2: Establish Networking
IBM Watson used Juniper switches running at 10Gbps Ethernet (10GbE) speeds, but was not connected to the Internet while playing Jeopardy! Instead, these Ethernet links were for the POWER7 servers to talk to each other, and to access files over the Network File System (NFS) protocol to the internal customized SONAS storage I/O nodes.
The implementation will be able to run "disconnected from the Internet" as well. However, you will need Internet access to download the code and information sources. For our purposes, 1GbE should be sufficient. Connect your Ethernet hub to your DSL or Cable modem. Connect all three hosts to the Ethernet switch. Connect your keyboard, video monitor and mouse to the LCM, and connect the LCM to the three hosts.
Step 3: Install Linux and Middleware
To say I use Linux on a daily basis is an understatement. Linux runs on my Android-based cell phone, my laptop at work, my personal computers at home, most of our IBM storage devices from SAN Volume Controller to XIV to SONAS, and even on my Tivo at home which recorded my televised episodes of Jeopardy!
For this project, you can use any modern Linux distribution that supports KVM. IBM Watson used Novel SUSE Linux Enterprise Server [SLES 11]. Alternatively, I can also recommend either Red Hat Enterprise Linux [RHEL 6] or Canonical [Ubuntu v10]. Each distribution of Linux comes in different orientations. Download the the 64-bit "ISO" files for each version, and burn them to CDs.
Graphical User Interface (GUI) oriented, often referred to as "Desktop" or "HPC-Head"
Command Line Interface (CLI) oriented, often referred to as "Server" or "HPC-Compute"
Guest OS oriented, to run in a Hypervisor such as KVM, Xen, or VMware. Novell calls theirs "Just Enough Operating System" [JeOS].
For this version for personal use, I have chosen a [multitier architecture], sometimes referred to as an "n-tier" or "client/server" architecture.
Host 1 - Presentation Server
For the Human-Computer Interface [HCI], the IBM Watson received categories and clues as text files via TCP/IP, had a [beautiful avatar] representing a planet with 42 circles streaking across in orbit, and text-to-speech synthesizer to respond in a computerized voice. Your implementation will not be this sophisticated. Instead, we will have a simple text-based Query Panel web interface accessible from a browser like Mozilla Firefox.
Host 1 will be your Presentation Server, the connection to your keyboard, video monitor and mouse. Install the "Desktop" or "HPC Head Node" version of Linux. Install [Apache Web Server and Tomcat] to run the Query Panel. Host 1 will also be your "programming" host. Install the [Java SDK] and the [Eclipse IDE for Java Developers]. If you always wanted to learn Java, now is your chance. There are plenty of books on Java if that is not the language you normally write code.
While three little systems doesn't constitute an "Extreme Cloud" environment, you might like to try out the "Extreme Cloud Administration Tool", called [xCat], which was used to manage the many servers in IBM Watson.
Host 2 - Business Logic Server
Host 2 will be driving most of the "thinking". Install the "Server" or "HPC Compute Node" version of Linux. This will be running a server virtualization Hypervisor. I recommend KVM, but you can probably run Xen or VMware instead if you like.
Host 3 - File and Database Server
Host 3 will hold your information sources, indices, and databases. Install the "Server" or "HPC Compute Node" version of Linux. This will be your NFS server, which might come up as a question during the installation process.
Technically, you could run different Linux distributions on different machines. For example, you could run "Ubuntu Desktop" for host 1, "RHEL 6 Server" for host 2, and "SLES 11" for host 3. In general, Red Hat tries to be the best "Server" platform, and Novell tries to make SLES be the best "Guest OS".
My advice is to pick a single distribution and use it for everything, Desktop, Server, and Guest OS. If you are new to Linux, choose Ubuntu. There are plenty of books on Linux in general, and Ubuntu in particular, and Ubuntu has a helpful community of volunteers to answer your questions.
Step 4: Download Information Sources
You will need some documents for your implementation to process.
IBM Watson used a modified SONAS to provide a highly-available clustered NFS server. For this version, we won't need that level of sophistication. Configure Host 3 as the NFS server, and Hosts 1 and 2 as NFS clients. See the [Linux-NFS-HOWTO] for details. To optimize performance, host 3 will be the "official master copy", but we will use a Linux utility called rsync to copy the information sources over to the hosts 1 and 2. This allows the task engines on those hosts to access local disk resources during question-answer processing.
We will also need a relational database. You won't need a high-powered IBM DB2. Your implementation can do fine with something like [Apache Derby] which is the open source version of IBM CloudScape from its Informix acquisition. Set up Host 3 as the Derby Network Server, and Hosts 1 and 2 as Derby Network Clients. For more about structured content in relational databases, see my post [IBM Watson - Business Intelligence, Data Retrieval and Text Mining].
Linux includes a utility called wget which allows you to download content from the Internet to your system. What documents you decide to download is up to you, based on what types of questions you want answered. For example, if you like Literature, check out the vast resources at [FullBooks.com]. You can automate the download by writing a shell script or program to invoke wget to all the places you want to fetch data from. Rename the downloaded files to something unique, as often they are just "index.html". For more on wget utility, see [IBM Developerworks].
Step 5: The Query Panel - Parsing the Question
Next, we need to parse the question and have some sense of what is being asked for. For this we will use [OpenNLP] for Natural Language Processing, and [OpenCyc] for the conceptual logic reasoning. See Doug Lenat presenting this 75-minute video [Computers versus Common Sense]. To learn more, see the [CYC 101 Tutorial].
Unlike Jeopardy! where Alex Trebek provides the answer and contestants must respond with the correct question, we will do normal Question-and-Answer processing. To keep things simple, we will limit questions to the following formats:
Who is ...?
Where is ...?
When did ... happen?
What is ...?
Host 1 will have a simple Query Panel web interface. At the top, a place to enter your question, and a "submit" button, and a place at the bottom for the answer to be shown. When "submit" is pressed, this will pass the question to "main.jsp", the Java servlet program that will start the Question-answering analysis. Limiting the types of questions that can be posed will simplify hypothesis generation, reduce the candidate set and evidence evaluation, allowing the analytics processing to continue in reasonable time.
Step 6: Unstructured Information Management Architecture
The "heart and soul" of IBM Watson is Unstructured Information Management Architecture [UIMA]. IBM developed this, then made it available to the world as open source. It is maintained by the [Apache Software Foundation], and overseen by the Organization for the Advancement of Structured Information Standards [OASIS].
Basically, UIMA lets you scan unstructured documents, gleam the important points, and put that into a database for later retrieval. In the graph above, DBs means 'databases' and KBs means 'knowledge bases'. See the 4-minute YouTube video of [IBM Content Analytics], the commercial version of UIMA.
Starting from the left, the Collection Reader selects each document to process, and creates an empty Common Analysis Structure (CAS) which serves as a standardized container for information. This CAS is passed to Analysis Engines , composed of one or more Annotators which analyze the text and fill the CAS with the information found. The CAS are passed to CAS Consumers which do something with the information found, such as enter an entry into a database, update an index, or update a vote count.
(Note: This point requires, what we in the industry call a small matter of programming, or [SMOP]. If you've always wanted to learn Java programming, XML, and JDBC, you will get to do plenty here. )
If you are not familiar with UIMA, consider this [UIMA Tutorial].
Step 7: Parallel Processing
People have asked me why IBM Watson is so big. Did we really need 2,880 cores of processing power? As a supercomputer, the 80 TeraFLOPs of IBM Watson would place it only in 94th place on the [Top 500 Supercomputers]. While IBM Watson may be the [Smartest Machine on Earth], the most powerful supercomputer at this time is the Tianhe-1A with more than 186,000 cores, capable of 2,566 TeraFLOPs.
To determine how big IBM Watson needed to be, the IBM Research team ran the DeepQA algorithm on a single core. It took 2 hours to answer a single Jeopardy question! Let's look at the performance data:
Number of cores
Time to answer one Jeopardy question
Single IBM Power750 server
< 4 minutes
Single rack (10 servers)
< 30 seconds
IBM Watson (90 servers)
< 3 seconds
The old adage applies, [many hands make for light work]. The idea is to divide-and-conquer. For example, if you wanted to find a particular street address in the Manhattan phone book, you could dispatch fifty pages to each friend and they could all scan those pages at the same time. This is known as "Parallel Processing" and is how supercomputers are able to work so well. However, not all algorithms lend well to parallel processing, and the phrase [nine women can't have a baby in one month] is often used to remind us of this.
Fortuantely, UIMA is designed for parallel processing. You need to install UIMA-AS for Asynchronous Scale-out processing, an add-on to the base UIMA Java framework, supporting a very flexible scale-out capability based on JMS (Java Messaging Services) and ActiveMQ. We will also need Apache Hadoop, an open source implementation used by Yahoo Search engine. Hadoop has a "MapReduce" engine that allows you to divide the work, dispatch pieces to different "task engines", and the combine the results afterwards.
Host 2 will run Hadoop and drive the MapReduce process. Plan to have three KVM guests on Host 1, four on Host 2, and three on Host 3. That means you have 10 task engines to work with. These task engines can be deployed for Content Readers, Analysis Engines, and CAS Consumers. When all processing is done, the resulting votes will be tabulated and the top answer displayed on the Query Panel on Host 1.
Step 8: Testing
To simplify testing, use a batch processing approach. Rather than entering questions by hand in the Query Panel, generate a long list of questions in a file, and submit for processing. This will allow you to fine-tune the environment, optimize for performance, and validate the answers returned.
There you have it. By the time you get your implementation fully operational, you will have learned a lot of useful skills, including Linux administration, Ethernet networking, NFS file system configuration, Java programming, UIMA text mining analysis, and MapReduce parallel processing. Hopefully, you will also gain an appreciation for how difficult it was for the IBM Research team to accomplish what they had for the Grand Challenge on Jeopardy! Not surprisingly, IBM Watson is making IBM [as sexy to work for as Apple, Google or Facebook], all of which started their business in a garage or a basement with a system as small as this version for personal use.
As we get to larger and larger flash and spinning disk drives, a common question I get is whether to use RAID-5 versus RAID-6. Here is my take on the matter.
A quick review of basic probability statistics
Failure rates are based on probabilities. Take for example a traditional six-sided die, with numbers one through six represented as dots on each face. What are the chances that we can roll the die several times in a row, that we will have no sixes ever rolled? You might think that if there is a 1/6 (16.6 percent) chance to roll a six, then you would guarantee hit a six after six rolls. That is not the case.
# of Rolls
Probability of no sixes (percent)
So, even after 24 rolls, there is more than 1 percent chance of not rolling a six at all. The formula is (1-1/6) to the 24th power.
Let's say that rolling one to five is success, and rolling a six is a failure. Being successful requires that no sixes appear in a sequence of events. This is the concept I will use for the rest of this post. If you don't care for the math, jump down to the "Summary of Results" section below.
Error Correcting Codes (ECC) and Unreadable Read Errors (URE)
When I speak to my travel agent, I have to provide my six-character [Record Locator] code. Pronouncing individual letters can be error prone, so we use a "spelling alphabet".
The International Radiotelephony Spelling Alphabet, sometimes known as the [NATO phonetic alphabet], has 26 code words assigned to the 26 letters of the English alphabet in alphabetical order as follows: Alfa, Bravo, Charlie, Delta, Echo, Foxtrot, Golf, Hotel, India, Juliett, Kilo, Lima, Mike, November, Oscar, Papa, Quebec, Romeo, Sierra, Tango, Uniform, Victor, Whiskey, X-ray, Yankee, Zulu.
Foxtrot Golf Mike Oscar Victor Whiskey
Foxtrot Gold Mine Oscar Vector Whisker
Boxcart Golf Miko Boxcart Victor Whiskey
Having five or so characters to represent a single character may seem excessive, but you can see that this can be helpful when communications link has static, or background noise is loud, as is often the case at the airport!
If spelling words are misheard, either (a) they are close enough like "Gold" for "Golf", or "Whisker" for "Whiskey", that the correct word is known, or (b) not close enough, such that "Boxcart" could refer to either "Foxtrot" or "Oscar" that we can at least detect that the failure occurred.
For data transfers, or data that is written, and later read back, the functional equivalent is an Error Correcting Code [ECC], used in transmission and storage of data. Some basic ECC can correct a single bit error, and detect double bit errors as failures. More sophisticated ECC can correct multiple bit errors up to a certain number of bits, and detect most anything worse.
When reading a block, sector or page of data from a storage device, if the ECC detects an error, but is unable to correct the bits involved, we call this an "Unrecoverable Read Error", or URE for short.
Bit Error Rate (BER)
Different storage devices have different block, sector or page sizes. Some use 512 bytes, 4096 bytes or 8192 bytes, for example. To normalize likelihood of errors, the industry has simplified this to a single bit error rate or BER, represented often as a power of 10.
Bit Error Rate per read (BER)
Consumer HDD (PC/Laptops)
Enterprise 15k/10k/7200 rpm
Solid-State and Flash
IBM TS1150 tape
In other words, the chance that a bit is unreadable on optical media is 1 in 10 trillion (1E13), on enterprise 15k drives is 1 in 10 quadrillion, and on LTO-7 tape is 1 in 10 quintillion.
There are eight bits per byte, so reading 1 GB of data is like rolling the die eight billion times. The chance of successfully reading 1GB on DVD, then would be (1 - 1/1E13) to the 8 billionth power, or 99.92 percent, or conversely a 0.08 percent chance of failure.
In this paper, Google had studied drive failure using an "Annual Failure Rate" or AFR. Here are two graphs from this paper:
This first graph shows AFR by age. Some drives fail in their first 3-6 months, often called "infant mortality". Then they are fairly reliable for a few years, down to 1.7 percent, then as they get older, they start to fail more often, up to 8.3 percent.
This second graph factors in how busy the drives are. Dividing the drive set into quartiles, "Low" represents the least busy drives (the bottom quartile), "Medium" represents the median two quartiles, and "High" represents the busiest drives, the top quartile. Not surprisingly, the busiest drives tend to fail more often than medium-busy drives.
Given an AFR, what are the chances a drive will fail in the next hour? There are 8,766 hours per year, so the success of a drive over the course of a year is like rolling the die 8,766 times. This allows us to calculate a "Drive Error Rate" or DER:
Drive Error Rate per hour (DER)
For example, an AFR=3 drive has a 1 in 287,800 chance of failing in a particular hour. The probability this drive will fail in the next 24 hours would be like rolling the die 24 times. The formula is (1-1/287,800) to the 24th power, resulting in a failure rate of roughly 0.008 percent.
Let's take a typical RAID-5 rank with 600GB drives at 15K rpm, in a 7+P RAID-5 configuration.
During normal processing, if a URE occurs on a individual drive, RAID comes to the rescue. The system can rebuild the data from parity, and correct the broken block of data.
When a drive fails, however, we don't have this rescue, so a URE that occurs during the rebuild process is catastrophic. How likely is this? Data is read from the other seven drives, and written to a spare empty drive. At 8 bits per byte, reading 4200 GB of data is rolling the die 33.6 trillion times. The formula is then (1-1/E16) to the 33.6 trillionth power, or approximately 0.372 percent chance of URE during the rebuild process.
The time to perform the rebuild depends heavily on the speed of the drive, and how busy the RAID rank is doing other work. Under heavy load, the rebuild might only run at 25 MB/sec, and under no workload perhaps 90 MB/sec. If we take a 60 MB/sec moderate rebuild rate, then it would take 10,000 seconds or nearly 3 hours. The chance that any of the seven drives fail during these three hours, at AFR=10 rolling the DER die (7 x 3) 21 times, results in a 0.025 percent chance of failure.
It is nearly 15 times more likely to get a URE failure than a second drive failure. A rebuild failure would happen with either of these, with a probability of 0.397 percent.
The situation gets worse with higher capacity Nearline drives. Let's do a RAID-5 rank with 6TB Nearline drives at 7200 rpm, in a 7+P configuration. The likelihood of URE reading 42 TB of data, is rolling the die 336 trillion times, or approximately 3.66 percent chance of URE failure. Yikes!
The time to rebuild is also going to take longer. A moderate rebuild rate might only be 30 MB/sec, so that rebuilding a 6TB drive would take 55 hours. The chance that one of the other seven drives fail, assuming again AFR=10, during these 55 hours results in a 0.462 percent.
This time, a URE failure is nearly eight times more likely than a double drive failure. The chance of a rebuild failure is 4.12 percent. Good thing you backed up to tape or object storage!
The math can be done easily using modern spreadsheet software. The URE failure rate is based on the quantity of data read from the remaining drives, so a 4+P with 600GB drives is the same as 8+P with 300GB drives. Both read 2.4 TB of data to recalculate from parity. The Double Drive failure rate is based on the number of drives being read times the number of hours during the rebuild. Slower, higher capacity drives take longer to rebuild. However, in both the 15K and 7200rpm examples, the chance of a URE failure was 8 to 15 times more likely than double drive failure.
Many of the problems associated with RAID-5 above can be mitigated with RAID-6.
After a single drive fails, any URE during rebuild can be corrected from parity. However, if a second drive fails during the rebuild process, then a URE on the remaining drives would be a problem.
Let's start with the 600GB 15k drives in a 6+P+Q RAID-6 configuration. The chance of a second drive failing is 0.0252 percent, as we calculated above. The likelihood of a URE is then based on the remaining six drives, 3600 GB of data. Doing the math, that is 0.0319 percent chance. So, the change of a URE during RAID-6 failure is the probability of both occurring, roughly 0.0000806 percent. Far more reliable than RAID-5!
Likewise, we can calculate the probability of a triple drive failure. After the second drive fails, the likelihood of a third drive at AFR=10, results in 0.00000546 percent.
Combining these, the chance of failure of rebuild is 0.000861 percent.
Switching to 6 TB Nearline drives, in a 6+P+Q RAID-6 configuration, we can do the math in the same manner. The likelihood of URE and two drives failing is 0.0145 percent, and for triple drive failure is 0.00183 percent. Chance of rebuild failure is 0.0163 percent.
Summary of Results
Putting all the results in a table, we have the following:
RAID-5 rebuild failure (percent)
RAID-6 rebuild failure (percent)
600GB 15K rpm
6 TB 7200rpm
Hopefully, I have shown you how to calculate these yourself, so that you can plug in your own drive sizes, rebuild rates, and other parameters to convince yourself of this.
In all cases, RAID-6 drastically reduced the probability of rebuild failure. With modern cache-based systems, the write-penalty associated with additional parity generally does not impact application performance. As clients transition from faster 15K drives to slower, higher capacity 10K and 7200 rpm drives, I highly recommend using RAID-6 instead of RAID-5 in all cases.
In the prank, I indicated that I had submitted my video to the [Arizona International Film Festival], of AIFF for short, which coincidently was running April 1-20, and that it had won an award. I invited everyone who read my blog to see me accept the award at a ceremony at 6:00pm on April 1 at the Fox Theater, followed by the 8:00pm showing of another award-winning film.
I didn't submit the video, the video didn't win any award, and I was not invited to the award ceremony. I did, however, plan to see the movie at 8:00pm.
When I got there, I learned that a dozen of my friends, not realizing it was a prank, showed up, asking for me. The AIFF was quite amused, and invited me to award ceremony still going on. The other filmmakers were impressed I had concocted such an elaborate social media campaign!
A slideshow is another style of video, animating still images to music. The [Ken Burns effect] was named after the technique fellow filmmaker Ken Burns used in his documentaries.
In 2010, I worked with the XIV team to address FUD that our competitors were flinging about double drive failures. My blog post [Double Drive Failure Debunked: XIV Two Years Later] set the record straight and put this issue to rest once and for all. XIV sales shot up dramatically after this post went public!
Are you tired of hearing about Cloud Computing without having any hands-on experience? Here's your chance. IBM has recently launched its IBM Development and Test Cloud beta. This gives you a "sandbox" to play in. Here's a few steps to get started:
Generate a "key pair". There are two keys. A "public" key that will reside in the cloud, and a "private" key that you download to your personal computer. Don't lose this key.
Request an IP address. This step is optional, but I went ahead and got a static IP, so I don't have to type in long hostnames like "vm353.developer.ihost.com".
Request storage space. Again, this step is optional, but you can request a 50GB, 100GB and 200GB LUN. I picked a 200GB LUN. Note that each instance comes with some 10 to 30GB storage already. The advantage to a storage LUN is that it is persistent, and you can mount it to different instances.
Start an "instance". An "instance" is a virtual machine, pre-installed with whatever software you chose from the "asset catalog". These are Linux images running under Red Hat Enterprise Virtualization (RHEV) which is based on Linux's kernel virtual machine (KVM). When you start an instance, you get to decide its size (small, medium, or large), whether to use your static IP address, and where to mount your storage LUN. On the examples below, I had each instance with a static IP and mounted the storage LUN to /media/storage subdirectory. The process takes a few minutes.
So, now that you are ready to go, what instance should you pick from the catalog? Here are three examples to get you started:
IBM WebSphere sMASH Application Builder
Base OS server to run LAMP stack
Next, I decided to try out one of the base OS images. There are a lot of books on Linux, Apache, MySQL and PHP (LAMP) which represents nearly 70 percent of the web sites on the internet. This instance let's you install all the software from scratch. Between Red Hat and Novell SUSE distributions of Linux, Red Hat is focused on being the Hypervisor of choice, and SUSE is focusing on being the Guest OS of choice. Most of the images on the "asset catalog" are based on SLES 10 SP2. However, there was a base OS image of Red Hat Enterprise Linux (RHEL) 5.4, so I chose that.
To install software, you either have to find the appropriate RPM package, or download a tarball and compile from source. To try both methods out, I downloaded tarballs of Apache Web Server and PHP, and got the RPM packages for MySQL. If you just want to learn SQL, there are instances on the asset catalog with DB2 and DB2 Express-C already pre-installed. However, if you are already an expert in MySQL, or are following a tutorial or examples based on MySQL from a classroom textbook, or just want a development and test environment that matches what your company uses in production, then by all means install MySQL.
This is where my SSH client comes in handy. I am able to login to my instance and use "wget" to fetch the appropriate files. An alternative is to use "SCP" (also part of PuTTY) to do a secure copy from your personal computer up to the instance. You will need to do everything via command line interface, including editing files, so I found this [VI cheat sheet] useful. I copied all of the tarballs and RPMs on my storage LUN ( /media/storage ) so as not to have to download them again.
Compiling and configuring them is a different matter. By default, you login as an end user, "idcuser" (which stands for IBM Developer Cloud user). However, sometimes you need "root" level access. Use "sudo bash" to get into root level mode, and this allows you to put the files where they need to be. If you haven't done a configure/make/make install in awhile, here's your chance to relive those "glory days".
In the end, I was able to confirm that Apache, MySQL and PHP were all running correctly. I wrote a simple index.php that invoked phpinfo() to show all the settings were set correctly. I rebooted the instance to ensure that all of the services started at boot time.
Rational Application Developer over VDI
This last example, I started an instance pre-installed with Rational Application Developer (RAD), which is a full Integrated Development Environment (IDE) for Java and J2EE applications. I used the "NX Client" to launch a virtual desktop image (VDI) which in this case was Gnome on SLES 10 SP2. You might want to increase the screen resolution on your personal computer so that the VDI does not take up the entire screen.
From this VDI, you can launch any of the programs, just as if it were your own personal computer. Launch RAD, and you get the familiar environment. I created a short Java program and launched it on the internal WebSphere Application Server test image to confirm it was working correctly.
If you are thinking, "This is too good to be true!" there is a small catch. The instances are only up and running for 7 days. After that, they go away, and you have to start up another one. This includes any files you had on the local disk drive. You have a few options to save your work:
Copy the files you want to save to your storage LUN. This storage LUN appears persistent, and continues to exist after the instance goes away.
Take an "image" of your "instance", a function provided in the IBM Developer and Test Cloud. If you start a project Monday morning, work on it all week, then on Friday afternoon, take an "image". This will shutdown your instance, and backup all of the files to your own personal "asset catalog" so that the next time you request an instance, you can chose that "image" as the starting point.
Another option is to request an "extension" which gives you another 7 days for that instance. You can request up to five unique instances running at the same time, so if you wanted to develop and test a multi-host application, perhaps one host that acts as the front-end web server, another host that does some kind of processing, and a third host that manages the database, this is all possible. As far as I can tell, you can do all the above from either a Windows, Mac or Linux personal computer.
Getting hands-on access to Cloud Computing really helps to understand this technology!
Back in June, I mentioned this blog was [Moving to MyDeveloperWorks] which is based on IBM Lotus Connections.
Finally, the move is complete for all bloggers. If you are having problems with the redirects, you might need to unsubscribe and re-subscribe in your RSS feed reader. Here are the new links for several IBM bloggers that have moved over: