Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Get started with Hadoop-based data analytics on IBM SmartCloud Enterprise

Raul F Chong (rfchong@ca.ibm.com), Big Data and DB2 Senior Program Manager and Technical Evangelist, IBM Software Group
Photo of author Raul Chong
Raul F. Chong is a senior DB2, big data, and cloud program manager and technical evangelist based at the IBM Toronto Laboratory. His main responsibility is growing the DB2 and big data communities around the world. Raul is a DB2 Certified Solutions Expert in both DB2 administration and application development. He has held numerous positions at IBM since 1997 and is the lead author of the book Understanding DB2: Learning Visually with Examples, 2nd Edition (ISBN-10: 0131580183). He is also the author of more than 5 books and 30 articles. For a longer biography, visit his developerWorks profile.

Summary:  Cloud computing and big data analytics go together — cloud provides the benefits of elasticity, on-demand access to resources, and utility-like billing while big data processing/analytics delivers a framework to take advantage of cloud resources. The combination of cloud and Hadoop make it possible to handle large amounts of structured and unstructured data. In this article, the author explains how to get started using Hadoop (in the form of InfoSphere® BigInsights Basic) on IBM® SmartCloud Enterprise. Learn how to set up a three-node cluster and verify your cluster is working.

Date:  11 Oct 2011
Level:  Introductory PDF:  A4 and Letter (1104 KB | 19 pages)Get Adobe® Reader®
Also available in:   Chinese  Japanese  Portuguese

Activity:  23059 views
Comments:  

Cloud computing and big data analytics are two areas of technology that currently are gaining a lot of traction:

  • Cloud computing provides the benefits of elasticity, on-demand access to resources, and utility-like billing.
  • Big data processing and analytics using Hadoop provides a framework to take advantage of these resources by distributing the workload into a cluster of computers.

Thanks to the cloud and Hadoop, it is now possible to handle large amounts of structured or unstructured data in a timely manner. Though Hadoop was not designed for virtualized environments such as the ones provided with the cloud, the cloud still provides an environment that is easy to set up and cost effective. The results of running a Hadoop job on physical nodes are likely to be superior to running the same job on virtualized nodes on the cloud; nevertheless, the cloud has opened the doors to any kind of user to run a Hadoop job which means a user is able to manipulate big data, something not possible in the past.

Currently, there is a lack of skill in knowing how to configure and manage cloud and Hadoop technologies. By using the hands-on instruction in this article, you should be able to jump into these technologies quickly and effectively. This article shows you:

  • The process of provisioning three instances on IBM SmartCloud Enterprise to set up a three-node cluster.
  • How to verify your cluster is working by stopping and starting all Hadoop components, testing a few commands, and reviewing the web console.

You should be able to follow the same instructions in this article to set up a larger cluster that satisfies your needs.

Before you start

Prerequisites to "play along" with this article

IBM InfoSphere BigInsights Basic software (BigInsights for short) is used for this article. BigInsights is IBM's distribution of Hadoop with additional features. The basic edition is available at no charge. Find out more in the sidebar.

If you are new to Hadoop, you can take the free online course Hadoop Fundamentals I at BigDataUniversity.com which includes videos and lab exercises. This course also includes a video demonstration of the set up as described in this article, as well as a video demonstration of running some Hadoop commands on the IBM Cloud. This material is provided in, "Lesson 1: Hands-on lab: Creating your own Hadoop cluster, Option 3" in the course. If you want to take a more detailed course, IBM offers the fee-based, "InfoSphere BigInsights Essential" class. See the sidebar for links to these resources.

If you prefer to read instructions while trying these hands-on exercises, please continue reading.

Ready to get started? You'll need an IBM Cloud account; if you don't have one, you can take advantage of the free trial available until November 11, 2011 (see the sidebar).


Getting started with data analytics on the cloud

To provision three instances in the IBM Cloud to set up a three-node cluster, and to verify and test your cluster:

  1. Log on to the IBM Cloud.
  2. Provision a BigInsights Master node instance.
  3. Provision a BigInsights Data node instance.
  4. Verify your Hadoop cluster is working.

Step 1: Log on to the IBM Cloud

  1. Open the IBM Cloud portal page to sign in.
  2. Enter your user ID and password and click Submit.

    Figure 1. The IBM Cloud Sign in page
    The IBM Cloud sign in page

  3. After logging on, the IBM Cloud dashboard opens with the Overview tab selected as shown in Figure 2. This displays the instances you've provisioned in the past. Click the Control panel tab.

    Figure 2. The IBM Cloud dashboard
    The IBM Cloud dashboard

Step 2: Provision a BigInsights Master Node instance

At the time of writing, the IBM Cloud offers two types of images for BigInsights:

  • IBM BigInsights Basic 1.1 Hadoop Master node
  • IBM BigInsights Basic 1.1 Hadoop Data node

These images are running under RedHat Enterprise Linux (RHEL) 5.6, 64-bit with the "pay as you go" option. As mentioned earlier, there is no charge for BigInsights Basic edition, but there is a charge of US$0.30/hour (at the time of writing) for using RHEL and the IBM Cloud infrastructure.

Hadoop uses a master-slave architecture where the master includes a NameNode and a JobTracker node and the slaves include a DataNode and a TaskTracker node.

Hadoop can be configured so you work in one of three different modes:

  • standalone mode: Does not start all components and works on a single node.
  • pseudo-distributed mode: Starts all components and works on a single node.
  • fully distributed mode: Starts all components and requires you to work on more than one node.

The standalone and pseudo-distributed modes are typically used in development or testing while the fully distributed mode is typically used in production scenarios.

This article assumes you are working in either pseudo-distributed or fully distributed mode depending on whether you provision Hadoop Data nodes in addition to the Hadoop Master node.

  • If you only provision a Hadoop Master node, and therefore only work on that single node, you are working in pseudo-distributed mode.
  • If you provision one or more Hadoop Data nodes in addition to the Hadoop Master node, you are working in fully distributed mode.

The IBM Cloud BigInsights images have been configured so the cluster is easily built simply by specifying the IP address of the Hadoop Master node when provisioning Hadoop Data nodes. The Hadoop Master node instance must be provisioned first.

If you want to work in the standalone mode, you can provision a BigInsights Master node and set this mode of operation in Hadoop by commenting out any parameters in the files core-site.xml, hdfs-site.xml, and mapred-site.xml.

Let's provision the Hadoop Master node instance.

  1. From the Control panel tab, click Add instance.

    Figure 3. Adding a BigInsights instance
    Adding a BigInsights instance

  2. Select the data center where you want to run your instance.

    Figure 4. Select a data center
    Select a data center window

    The BigInsights images should be available in all data centers. In this example, the Markham, Canada data center is selected.

  3. Once the data center is selected, a list of available images in that data center is displayed. Select the IBM BigInsights Basic 1.1 - Hadoop Master Node image and click Next.

    Figure 5. Select BigInsights Basic 1.1 - Hadoop Master Node
    Select the BigInsights Hadoop Master Node image

  4. Configure the BigInsights Hadoop Master Node image.

    Figure 6. Configure BigInsights Hadoop Master Node image
    Configure the BigInsights Hadoop Master Node image

    In this example, the instance is named "Hadoop master".

    MapReduce: Making Hadoop great

    MapReduce is a software framework that supports distributed computing on large data sets on clusters of computers; it was derived from the map and reduce functions commonly used in functional programming (although what they do in MapReduce is not the same as the original functions).

    For MapReduce, in the map step the Master node takes the input, partitions it into smaller segments, and distributes them to Worker nodes; this may continue down the chain, resulting in a multi-branching tree structure. After the problem is processed, it goes back up the chain to the Master node. In the reduce step the Master node collects all the answers and assembles them to solve the original problem.

    For the Server configuration option, Copper is probably a good enough size for your instance assuming you are just trying this out and not really setting it up for production. If you are setting it up for production, you should first review the performance of your MapReduce jobs with each of the different configuration sizes (Copper, Bronze, Silver, Gold, Platinum) in the IBM Cloud. You can also try the benchmarks specified in the Hadoop wiki (Resources). If you try these benchmarks, try it with only one node (the Hadoop Master node). After you have set up your cluster, you can repeat the benchmarks again. For the BigInsights IBM Cloud images in particular, the following specific commands need to be run for the benchmark:

    cd /mnt/biginsights/opt/ibm/biginsights/IHC
    hadoop jar hadoop-*-examples.jar randomwriter rand 
    hadoop jar hadoop-*-examples.jar sort rand rand-sort
    

    Since I have worked with the IBM Cloud, I have keys previously generated under Key; therefore, I am reusing one of these keys. This example uses IBM Cloud Raul.

  5. Keep the defaults for all the other parameters and click Next.
  6. A summary of the configurations you specified for your image is displayed. If satisfied, click Next.

    Figure 7. Summary of the configuration for your Hadoop Master node image
    Summary of the configuration for your Hadoop Master node image

  7. A service agreement is presented to you. You must comply with the terms of agreement to continue. Click I agree (if you do!), and then click Submit.

    Figure 8. Agree to the service agreement to continue
    Agree to the service agreement to continue

  8. A successful message panel is displayed after submitting your request to provision the image.

    Figure 9. Success message after submitting the request to provision the instance
    Success message after submitting the request to provision the instance

  9. A few minutes later your instance is requested, provisioned, and then active, which means it is up and running and ready to use. For this image, all Hadoop components are automatically started as soon as the image is in active status. The IP address of the instance should also be displayed, as shown in Figure 10.

    Figure 10. Successful provisioning of the Hadoop Master node instance
    Successful provisioning of the Hadoop Master node instance

In this example, the IP address that was assigned to this Hadoop Master node instance is 170.224.193.137. Write down this number; you will use it when provisioning the Hadoop Data nodes.

Step 3: Provision a BigInsights Data node instance

After the Hadoop Master node instance has been provisioned, you can start provisioning as many Data nodes as you want to use for your cluster. In this case you want to provision two Data nodes to build a three-node Hadoop cluster.

Since the process of provisioning a Hadoop Data node instance is very similar to provisioning a Hadoop Master node instance, only the steps that are different, or require attention are described here.

  1. When working with Hadoop, accessing data in different data centers is the worst case scenario. Therefore, in this example the same data center chosen for the Master node (Markham, Canada) is used. However, for the image choose the IBM BigInsights Basic 1.1 - Hadoop Data node image, as shown in Figure 11.

    Figure 11. Select the Hadoop Data Node image in the Markham, Canada data center
    Select the Hadoop Data Node image in the Markham, Canada data center

  2. Enter Hadoop slave 1 as the name of the instance. For all the other settings, keep the defaults or choose the same values as you did for the Hadoop Master node.

    Figure 12. Configuring the BigInsights Hadoop Data node image
    Configuring the BigInsights Hadoop Data node image

  3. Once you have configured your BigInsights Hadoop Data node, a panel is displayed where you need to enter the IP address of the Hadoop Master node. This is necessary, so the Hadoop Data node can be automatically added to the cluster. Enter the IP address and click Next. In this example, the Hadoop Master node IP address is 170.224.193.137.

    Figure 13. Input the BigInsights Hadoop Master IP address
    Input the BigInsights Hadoop Master IP address

  4. Continue with the next steps, accepting the defaults, to add the Hadoop Data node instance.
  5. Repeat the exact same process to create another Hadoop Data node; this time, name it Hadoop slave 2.

Figure 14 shows the Hadoop Master instance and the two Hadoop Data node (slave) instances.


Figure 14. Your three-node Hadoop cluster is ready for use
Your three-node Hadoop cluster is ready for use

When all instances have been provisioned and are active, your Hadoop cluster is ready to use. Congratulations!

Step 4: Verifying your Hadoop cluster is working correctly

  1. From the IBM Cloud Control panel tab, click the Hadoop master instance. Scroll down. You should see information similar to what is displayed in Figure 15 and Figure 16.

    Figure 15. Hadoop Master node summary configuration
    Hadoop Master node summary configuration

    Figure 15 shows a summary of what has been configured for the Hadoop Master node.



    Figure 16. Hadoop Master node "Getting started" section
    Hadoop Master node Getting started section

    Figure 16 shows a list of useful links to monitor your cluster. In particular, take a look at the first link BigInsights Web Console. When you click that link, the Web Console opens in your browser as shown in Figure 17.



    Figure 17. The BigInsights Web Console
    The BigInsights Web Console

  2. From the BigInsights Web Console, confirm that the Hadoop cluster is running by verifying in the Components section that every component has been started. In the Start Stop Summary section, verify that the three nodes are started.
  3. Let's try some commands. Use putty to ssh to the Master node. As with other instances in the IBM Cloud, specify idcuser as the user.
  4. Stop all components of Hadoop with the stop-all.sh command.

    Figure 18. Issuing the stop-all.sh command to stop all components
    Issuing the stop-all.sh command to stop all components

  5. Now, start all components of Hadoop with the start-all.sh command.

    Figure 19. Issuing the start-all.sh command to start all components
    Issuing the start-all.sh command to start all components

  6. Execute these commands to verify things are working well:
    • hadoop fs -ls /
      This tests the Hadoop Distributed File System (HDFS) is working by listing all files and directories in the root of HDFS.
    • pig
      Grunt> quit;

      This starts pig and exits.
    • hive
      Hive> quit

      This starts hive and exits.
    • jaqlshell
      Jaql> quit;

      This starts jaql and exits.

    These same commands and output are shown in Figure 20.



    Figure 20. Testing some Hadoop commands and components
    Testing some Hadoop commands and components


In conclusion

This article described step-by-step instructions for setting up a three-node Hadoop cluster in minutes on the IBM Cloud. The process is straightforward and can be replicated for a cluster of a larger size. You need to ensure the Hadoop Master node is provisioned first and write down its IP address so you can specify it when provisioning your Hadoop Data nodes.


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of author Raul Chong

Raul F. Chong is a senior DB2, big data, and cloud program manager and technical evangelist based at the IBM Toronto Laboratory. His main responsibility is growing the DB2 and big data communities around the world. Raul is a DB2 Certified Solutions Expert in both DB2 administration and application development. He has held numerous positions at IBM since 1997 and is the lead author of the book Understanding DB2: Learning Visually with Examples, 2nd Edition (ISBN-10: 0131580183). He is also the author of more than 5 books and 30 articles. For a longer biography, visit his developerWorks profile.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Information Management, Big data
ArticleID=765005
ArticleTitle=Get started with Hadoop-based data analytics on IBM SmartCloud Enterprise
publish-date=10112011