Cloud computing and big data analytics are two areas of technology that currently are gaining a lot of traction:
- Cloud computing provides the benefits of elasticity, on-demand access to resources, and utility-like billing.
- Big data processing and analytics using Hadoop provides a framework to take advantage of these resources by distributing the workload into a cluster of computers.
Thanks to the cloud and Hadoop, it is now possible to handle large amounts of structured or unstructured data in a timely manner. Though Hadoop was not designed for virtualized environments such as the ones provided with the cloud, the cloud still provides an environment that is easy to set up and cost effective. The results of running a Hadoop job on physical nodes are likely to be superior to running the same job on virtualized nodes on the cloud; nevertheless, the cloud has opened the doors to any kind of user to run a Hadoop job which means a user is able to manipulate big data, something not possible in the past.
Currently, there is a lack of skill in knowing how to configure and manage cloud and Hadoop technologies. By using the hands-on instruction in this article, you should be able to jump into these technologies quickly and effectively. This article shows you:
- The process of provisioning three instances on IBM SmartCloud Enterprise to set up a three-node cluster.
- How to verify your cluster is working by stopping and starting all Hadoop components, testing a few commands, and reviewing the web console.
You should be able to follow the same instructions in this article to set up a larger cluster that satisfies your needs.
IBM InfoSphere BigInsights Basic software (BigInsights for short) is used for this article. BigInsights is IBM's distribution of Hadoop with additional features. The basic edition is available at no charge. Find out more in the sidebar.
If you are new to Hadoop, you can take the free online course Hadoop Fundamentals I at BigDataUniversity.com which includes videos and lab exercises. This course also includes a video demonstration of the set up as described in this article, as well as a video demonstration of running some Hadoop commands on the IBM Cloud. This material is provided in, "Lesson 1: Hands-on lab: Creating your own Hadoop cluster, Option 3" in the course. If you want to take a more detailed course, IBM offers the fee-based, "InfoSphere BigInsights Essential" class. See the sidebar for links to these resources.
If you prefer to read instructions while trying these hands-on exercises, please continue reading.
Ready to get started? You'll need an IBM Cloud account; if you don't have one, you can take advantage of the free trial available until November 11, 2011 (see the sidebar).
To provision three instances in the IBM Cloud to set up a three-node cluster, and to verify and test your cluster:
- Log on to the IBM Cloud.
- Provision a BigInsights Master node instance.
- Provision a BigInsights Data node instance.
- Verify your Hadoop cluster is working.
- Open the IBM Cloud portal page to sign in.
- Enter your user ID and password and click Submit.
Figure 1. The IBM Cloud Sign in page
- After logging on, the IBM Cloud dashboard opens with the Overview tab selected as shown in Figure 2. This displays the instances you've provisioned in the past. Click the Control panel tab.
Figure 2. The IBM Cloud dashboard
At the time of writing, the IBM Cloud offers two types of images for BigInsights:
- IBM BigInsights Basic 1.1 Hadoop Master node
- IBM BigInsights Basic 1.1 Hadoop Data node
These images are running under RedHat Enterprise Linux (RHEL) 5.6, 64-bit with the "pay as you go" option. As mentioned earlier, there is no charge for BigInsights Basic edition, but there is a charge of US$0.30/hour (at the time of writing) for using RHEL and the IBM Cloud infrastructure.
Hadoop uses a master-slave architecture where the master includes a NameNode and a JobTracker node and the slaves include a DataNode and a TaskTracker node.
Hadoop can be configured so you work in one of three different modes:
- standalone mode: Does not start all components and works on a single node.
- pseudo-distributed mode: Starts all components and works on a single node.
- fully distributed mode: Starts all components and requires you to work on more than one node.
The standalone and pseudo-distributed modes are typically used in development or testing while the fully distributed mode is typically used in production scenarios.
This article assumes you are working in either pseudo-distributed or fully distributed mode depending on whether you provision Hadoop Data nodes in addition to the Hadoop Master node.
- If you only provision a Hadoop Master node, and therefore only work on that single node, you are working in pseudo-distributed mode.
- If you provision one or more Hadoop Data nodes in addition to the Hadoop Master node, you are working in fully distributed mode.
The IBM Cloud BigInsights images have been configured so the cluster is easily built simply by specifying the IP address of the Hadoop Master node when provisioning Hadoop Data nodes. The Hadoop Master node instance must be provisioned first.
If you want to work in the standalone mode, you can provision a BigInsights Master node and set this mode of operation in Hadoop by commenting out any parameters in the files core-site.xml, hdfs-site.xml, and mapred-site.xml.
Let's provision the Hadoop Master node instance.
- From the Control panel tab, click Add instance.
Figure 3. Adding a BigInsights instance
- Select the data center where you want to run your instance.
Figure 4. Select a data center
The BigInsights images should be available in all data centers. In this example, the Markham, Canada data center is selected.
- Once the data center is selected, a list of available images in that data center is
displayed. Select the IBM BigInsights Basic 1.1 - Hadoop Master
Node image and click Next.
Figure 5. Select BigInsights Basic 1.1 - Hadoop Master Node
- Configure the BigInsights Hadoop Master Node image.
Figure 6. Configure BigInsights Hadoop Master Node image
In this example, the instance is named "Hadoop master".
For the Server configuration option, Copper is probably a good enough size for your instance assuming you are just trying this out and not really setting it up for production. If you are setting it up for production, you should first review the performance of your MapReduce jobs with each of the different configuration sizes (Copper, Bronze, Silver, Gold, Platinum) in the IBM Cloud. You can also try the benchmarks specified in the Hadoop wiki (Resources). If you try these benchmarks, try it with only one node (the Hadoop Master node). After you have set up your cluster, you can repeat the benchmarks again. For the BigInsights IBM Cloud images in particular, the following specific commands need to be run for the benchmark:
cd /mnt/biginsights/opt/ibm/biginsights/IHC hadoop jar hadoop-*-examples.jar randomwriter rand hadoop jar hadoop-*-examples.jar sort rand rand-sort
Since I have worked with the IBM Cloud, I have keys previously generated under Key; therefore, I am reusing one of these keys. This example uses IBM Cloud Raul.
- Keep the defaults for all the other parameters and click Next.
- A summary of the configurations you specified for your image is displayed. If satisfied, click Next.
Figure 7. Summary of the configuration for your Hadoop Master node image
- A service agreement is presented to you. You must comply with the
terms of agreement to continue. Click I agree (if you do!), and then
Figure 8. Agree to the service agreement to continue
- A successful message panel is displayed after submitting your request to provision the image.
Figure 9. Success message after submitting the request to provision the instance
- A few minutes later your instance is requested, provisioned, and then active, which
means it is up and running and ready to use. For this image, all Hadoop components are
automatically started as soon as the image is in active status. The IP address of the
instance should also be displayed, as shown in Figure 10.
Figure 10. Successful provisioning of the Hadoop Master node instance
In this example, the IP address that was assigned to this Hadoop Master node instance is 220.127.116.11. Write down this number; you will use it when provisioning the Hadoop Data nodes.
After the Hadoop Master node instance has been provisioned, you can start provisioning as many Data nodes as you want to use for your cluster. In this case you want to provision two Data nodes to build a three-node Hadoop cluster.
Since the process of provisioning a Hadoop Data node instance is very similar to provisioning a Hadoop Master node instance, only the steps that are different, or require attention are described here.
When working with Hadoop, accessing data in different data centers is the worst case
scenario. Therefore, in this example the same data center chosen for the Master node
(Markham, Canada) is used.
However, for the image choose the IBM BigInsights Basic 1.1 - Hadoop Data node image, as shown in Figure 11.
Figure 11. Select the Hadoop Data Node image in the Markham, Canada data center
- Enter Hadoop slave 1 as the name of the instance. For all the other
settings, keep the defaults or choose the same values as you did for the Hadoop Master node.
Figure 12. Configuring the BigInsights Hadoop Data node image
Once you have configured your BigInsights Hadoop Data node, a panel is displayed where you
need to enter the IP address of the Hadoop Master node. This is necessary, so the Hadoop Data node can be automatically added to the cluster.
Enter the IP address and click Next.
In this example, the Hadoop Master node IP address is 18.104.22.168.
Figure 13. Input the BigInsights Hadoop Master IP address
- Continue with the next steps, accepting the defaults, to add the Hadoop Data node instance.
- Repeat the exact same process to create another Hadoop Data node; this time, name it Hadoop slave 2.
Figure 14 shows the Hadoop Master instance and the two Hadoop Data node (slave) instances.
Figure 14. Your three-node Hadoop cluster is ready for use
When all instances have been provisioned and are active, your Hadoop cluster is ready to use. Congratulations!
- From the IBM Cloud Control panel tab, click the Hadoop master
instance. Scroll down. You should see information similar to what is displayed in Figure 15 and Figure 16.
Figure 15. Hadoop Master node summary configuration
Figure 15 shows a summary of what has been configured for the Hadoop Master node.
Figure 16. Hadoop Master node "Getting started" section
Figure 16 shows a list of useful links to monitor your cluster. In particular, take a look at the first link BigInsights Web Console. When you click that link, the Web Console opens in your browser as shown in Figure 17.
Figure 17. The BigInsights Web Console
- From the BigInsights Web Console, confirm that the Hadoop cluster is running by verifying in the Components section that every component has been started. In the Start Stop Summary section, verify that the three nodes are started.
- Let's try some commands. Use
sshto the Master node. As with other instances in the IBM Cloud, specify
idcuseras the user.
- Stop all components of Hadoop with the
Figure 18. Issuing the stop-all.sh command to stop all components
- Now, start all components of Hadoop with the
Figure 19. Issuing the start-all.sh command to start all components
- Execute these commands to verify things are working well:
hadoop fs -ls /
This tests the Hadoop Distributed File System (HDFS) is working by listing all files and directories in the root of HDFS.
These same commands and output are shown in Figure 20.
Figure 20. Testing some Hadoop commands and components
This article described step-by-step instructions for setting up a three-node Hadoop cluster in minutes on the IBM Cloud. The process is straightforward and can be replicated for a cluster of a larger size. You need to ensure the Hadoop Master node is provisioned first and write down its IP address so you can specify it when provisioning your Hadoop Data nodes.
Find out more about IBM InfoSphere BigInsights software, IBM's distribution of Hadoop with additional features.
If you are new to Hadoop, you can take the free online course Hadoop Fundamentals I at BigDataUniversity.com which includes videos and lab exercises.
For even more detailed instruction into Hadoop/BigInsights, try the InfoSphere BigInsights Essential class from IBM.
If you want to know how MapReduce jobs work with different configuration sizes, you can try the benchmarks specified in the Hadoop wiki.
For more on how to perform tasks in the IBM Cloud, visit these resources:
- Up and download files from a Windows instance.
- Install IIS web server on Windows 2008 R2.
- Create an IBM Cloud instance with the Linux command line.
- Create an IBM Cloud instance with the Windows command line.
- Extend your corporate network with the IBM Cloud.
- High availability apps in the IBM Cloud.
- Parameterize cloud images for custom instances on the fly.
- Windows-targeted approaches to IBM Cloud provisioning.
- Deploy products using rapid deployment service.
- Integrate your authentication policy using a proxy.
- Configure the Linux Logical Volume Manager.
- Deploy a complex topology using a deployment utility tool.
- Provision and configure an instance that spans a public and private VLAN.
- Secure IBM Cloud access for Android devices.
In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
Download IBM InfoSphere BigInsights Basic software.
See the product images available for IBM SmartCloud Enterprise.
Join a cloud computing group on developerWorks.
Read all the great cloud blogs on developerWorks.
Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Raul F. Chong is a senior DB2, big data, and cloud program manager and technical evangelist based at the IBM Toronto Laboratory. His main responsibility is growing the DB2 and big data communities around the world. Raul is a DB2 Certified Solutions Expert in both DB2 administration and application development. He has held numerous positions at IBM since 1997 and is the lead author of the book Understanding DB2: Learning Visually with Examples, 2nd Edition (ISBN-10: 0131580183). He is also the author of more than 5 books and 30 articles. For a longer biography, visit his developerWorks profile.