Contents


Collect and analyze social data without writing a single line of code

Build an ETL workflow with the Node-RED workflow editor and analyze your social data with IBM Analytics for Hadoop

Comments

This article was written using the Bluemix classic interface. Given the rapid evolution of technology, some steps and illustrations may have changed.

Do you think about how much you might learn from analyzing social data, but don't act because you don't have enough time or resource to build what you need? In this tutorial, we show you how easy it is to use the Node-RED workflow editor in IBM Bluemix™ to capture a social data feed (a Twitter feed) and then build a Hadoop Distributed File System (HDFS) file from that data. We also show you how to use the IBM Analytics for Hadoop service to analyze the data and produce summary charts. You’ll be amazed at how easy it is to turn an unknown data set into information you can use.

You’ll be amazed at how easy it is to turn an unknown data set into information you can use.

What you'll need to build your application

Step 1: Set up the Bluemix services

To implement the extract, transform, and load (ETL) workflow, you will use the Node-RED capabilities in Bluemix. To develop the flow, you first need to create a Node-RED application and add the IBM Analytics for Hadoop service to it.

  1. Log in to your Bluemix account (or sign up for a free trial).
  2. Click Catalog.
  3. Search for and then select Node-RED Starter.
  4. On the right, enter a sample name for your application in the Name field (which also appears in the Host field), and then click CREATE.

    Wait for your Node-RED application to be started. Before using the application, you also need to add the IBM Analytics for Hadoop service to it.

  5. On the left, click Back to Dashboard, and then click the Node-RED application that you created.
  6. Click Add a Service or API.
  7. On the left, under Category, check Big Data. Then, on the right, select IBM Analytics for Hadoop.
  8. On the right, click App, and then select your Node-RED application.
  9. Click CREATE and, when prompted, click RESTAGE. Wait for the application to be restaged and to be running.
  10. At the top, next to Routes, click the name of your Node-RED application, such as sampleName.mybluemix.net (where sampleName is the name that you used), to open a new browser window with your Node-RED application.
  11. Click the large button labeled Go to your Node-RED flow editor.

Step 2. Build the ETL workflow in Node-RED

Next, you will build the ETL by using the Node-RED workflow editor. The flow takes tweets from Twitter and dynamically builds a Hadoop Distributed File System (HDFS) file. You will use this file in the next step to analyze the tweets. The completed workflow in the Node-RED workflow editor looks like this workflow:

  1. Scroll the palette, and under social, drag a Twitter input node onto the canvas.
  2. Double-click the Twitter node to configure it:
    1. In the Log in as drop-down list, select Add new twitter-credentials, and then click the pencil icon. Click the button to authenticate with Twitter. Enter your Twitter credentials, and then click Authorize App, and then close that window.
    2. Verify that your Twitter ID is displayed, and click Add.
    3. In the for text field, enter cloud.
    4. In the Name field, enter cloud tweets, and then click Ok.
  3. Scroll the palette, and under storage, select the second ibm hdfs node (write) and drag it to the canvas.
  4. Using the mouse, connect the Twitter node with the hdfs node.
  5. Double-click the ibm hdfs node to configure it:
    1. In the Filename field, enter a name for the file that your application dynamically creates (for example, sampleTwitterData/stream). This file includes the tweets that match your criteria.
    2. Click Ok.
  6. In the upper right corner of the Node-RED workflow editor, click Deploy.
  7. Close the browser window.

Your service is now running. Twitter data is being collected and written to the file. The file exists in HDFS of the Hadoop (BigInsights) service, and can grow up to 20 GB, which is the storage limit in HDFS for the free BigInsights service.

Because HDFS supports linear scale-out, the size of the file is only limited by your budget. You can choose a more advanced plan to get more storage. The largest known HDFS is Yahoo, running with 455 PB, so you can see that Hadoop really scales. The advanced Hadoop clusters in Bluemix are run on bare metal hardware in SoftLayer. The smallest is 18 TB, but can be scaled out to multiple PB if needed.

Step 3. Use IBM Analytics for Hadoop to analyze the tweets

Now that the ETL is complete and data is collected, you are ready to analyze the data by using the IBM Analytics for Hadoop console in Bluemix.

  1. Return to Bluemix. In the Services section of your app, click the IBM Analytics for Hadoop service.
  2. On the service page, click LAUNCH to open up the BigInsights console.
  3. In IBM InfoSphere BigInsights, click the Files tab, and then browse through the file explorer to locate the file you created: /user/biblumixsampleTwitterData/stream
  4. Select Sheet, which is above the file, then select the file.

    The Sheet button turns on the BigSheets importer. You can think of BigSheets as a spread-sheet style web application that is capable of analyzing up to multiple PB of data. It manages this much data by defining the actual data processing workflow on a small sample of data and then pushing the data processing workflow as a MapReduce process to the Hadoop cluster.

  5. Click Save as Master Workbook.
    1. In the Name field, enter tweets
    2. Click Save, which automatically moves you to the BigSheets tab.
  6. Click Build new workbook. You must do this step because, by default, you cannot modify data in the initial workbook in BigSheets, because the original file that the workbook is based on is never modified. After you create a new workbook, the underlying data can be modified (as in the next two steps).
  7. Select Add sheets from the drop-down list, and then select Function.
    1. In the New sheet: Function dialog, click Categories, and then click Entities.
    2. Scroll the list and click Organization. When you select Organization, a built-in BigInsights (Watson/NLP based) capability is used to extract company names out of the data for analysis.
    3. From the Fill in parameter drop-down list, select Header, and then click the green check mark.

      The first column of the sheet (with the default name Header) will be used as an input to the function based on IBM Watson technology, which extracts the company names out of the tweets.

  8. A list of the different companies that are mentioned in conjunction with the term "cloud" in your tweets is displayed under Organization. The list represents just a subset of the data to help you design and test your analysis. Click Save > Save & Exit, and then click Save.
  9. In the middle of the window, click Run. A MapReduce job now applies the analytics to all collected tweets in your HDFS file. Wait until the progress bar in the upper right of the window shows 100%.
  10. Click Add chart > cloud > Bubble Cloud, and then click the green check mark.

    Initially, the chart is drawn based on the sample data.

  11. Click Run again to calculate aggregates on all your data in HDFS. Wait until the progress bar in the upper right of the window shows 100%.

    The final result shows count distribution, analyzed by organization, of the tweets of the last 10 minutes that mention cloud.

Here's an example chart from when we ran our application; your chart will be different because your twitter stream is from a different time period and social influences can be very dynamic, which is why good analytics is so critical. As you analyze different time slices, you might see very different results.

For example, the following figure shows the chart after the IBM Interconnect Conference in Las Vegas:

You can now close the IBM InfoSphere BigInsights application and also IBM Bluemix.

Conclusion

This tutorial showed you how to quickly build an ETL workflow by using Node-RED and how to analyze the data that was collected by using IBM Analytics for Hadoop. The whole project uses IBM Bluemix services, so you didn't have to write a single line of code. You are now ready to build other workflows with Node-RED, and can analyze any data you collect with the Hadoop analytics capabilities.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Cloud computing
ArticleID=1004329
ArticleTitle=Collect and analyze social data without writing a single line of code
publish-date=05062015