How-tos

Get started with Streaming Analytics and BigInsights on Bluemix using HDFS

Share this post:

Overview

Streams applications can integrate with HDFS in on-premise BigInsights clusters using the streamsx.hdfs toolkit. However, an extra layer of security in the cloud requires a special toolkit to access the BigInsights service in Bluemix. The HDFS for Bluemix toolkit contains Streams operators that can connect through the Knox Gateway. This article shows how to use these operators to read and write files to HDFS on Bluemix. The operators work in both a local Streams install and using the Streaming Analytics Bluemix service.

Why integrate with Hadoop?

Integrating with Hadoop in Bluemix empowers Streams applications.

  • Reading from Hadoop enables Streams to ingest data both already in Hadoop and as it is stored to HDFS.
  • Writing to Hadoop prepares the data for future data-at-rest analytics.

Prerequisites

Write to HDFS

Import HDFSBluemixDemo into Streams Studio from the streamsx.hdfs/samples/ directory. If the project doesn’t build successfully, make sure you have added the HDFS for Bluemix toolkit to Streams Studio.  We’ll start with the TestWrite application in the demo project to demonstrate writing to HDFS.

In Streams Studio, launch the TestWrite application. Fill in the username, password, and url submission time values with your BigInsights credentials and webhdfs url. The webhdfs url will look similar to webhdfs://[Ambari console ip]:8443. Leave the file parameter as-is to write to the /tmp directory in HDFS.

Read more about submission time values here.

The TestWrite application writes to HDFS:

writegraph

  • Input (Beacon) outputs rstring tuples to print to the HDFS file.
  • Sink (HDFS2FileSink) prints received tuples to remote HDFS file write_test_[timestamp].txt.
  • Log (Custom) prints the name of the file written by the HDFS2FileSink operator.

Viewing PE console output

The TestWrite application prints the name of the file to the Streams PE console.

View the demo’s output by viewing the instance graph, then showing PE Console of the printer operator. You should see the name and size of the newly created file:

Wrote 450 bytes to file hdfs_test_20151211_172531.txt

Viewing the file written to HDFS

The demo creates write_test_[timestamp].txt on the Hadoop filesystem in the /tmp directory.

To view the file, log onto the Hadoop system via SSH and run:

$ hadoop fs -ls /tmp

Read from HDFS

After verifying the HDFS file was written, we’ll try to read the file using Streams using the TestRead sample.

The TestRead application retrieves a file from HDFS and prints its contents:

readgraph

  • FromHDFS (HDFS2FileSource) reads the file back into Streams from HDFS, and outputs each line as an rstring tuple.
  • PrintFileContents (Custom) prints each incoming rstring to the PE console.

Launch the application from Streams Studio, specifying the same credentials and url as the TestWrite application.  Set the file parameter to the same value as was printed by the TestWrite sample.

After launching the TestRead application, view its output in the PE console of the PrintContents operator. You should see the file contents – ten numbered lines, each reading:

Output read from remote file: HDFS and Streams on Bluemix Test: New LINE #

Running the samples on Bluemix Streaming Analytics

The demo applications can also be deployed to the Streaming Analytics cloud service.

  1. Create a Streaming Analytics service by following the “Finding the service” section of Introduction to Bluemix Streaming Analytics.
  2. Click Launch in the Streaming Analytics Bluemix dashboard to launch the Streams console.
  3. From the Streams console, select “Submit Job” under the “play” icon.
    Streaming Analytics submit job
  4. Browse for and select the .sab file in your workspace directory: workspace/HDFSBluemixDemo/output/hdfsexample.TestWrite>/Distributed
  5. Click Next, enter the HDFS url, username, password, and file name as submission time values, and submit.
  6. After the operators start up, they will show a green circle in the Streams graph view. If not, resubmit and verify your submission time values.

After launching the TestWrite application, repeat the steps above to launch the TestRead sample.  View its output by loading the Console Log of the PrintContents operator.

Streaming Analytics log view
The file is printed “backwards” in the logs because the log viewer shows most recent logs first.

Conclusion

Bluemix HDFS operators enable Streams to read and write HDFS files on BigInsights. The HDFS for Bluemix toolkit works with Streams on-premise and in the Streaming Analytics Bluemix service.

Add Comment
3 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


chris snow

On BigInsights on cloud, the ssl certificate presented with WebHDFS connections is self signed. Does streams bypass https certificate verification when connection to WebHDFS?

Reply

apogue

By default, Streams will accept all server certificates. If you need to validate SSL certificates, export the certificate and add it to your truststore. See step 4 of the documentation at http://ibmstreams.github.io/streamsx.hdfs/com.ibm.streamsx.hdfs/3.5_bluemix/doc/spldoc/html/tk$com.ibm.streamsx.hdfs/tk$com.ibm.streamsx.hdfs$1.html

Reply

Jacques Roy

It looks like the API has changed since this article was published. Any plans on updating the article?

Reply
More How-tos Stories

Home automation powered by Cloud Functions, Raspberry Pi, Twilio and Watson

Over the past few years, we’ve seen a significant rise in popularity for intelligent personal assistants, such as Apple’s Siri, Amazon Alexa, and Google Assistant. Though they initially appeared to be little more than a novelty, they’ve evolved to become rather useful as a convenient interface to interact with service APIs and IoT connected devices.

Continue reading

Interpreting Spring Social Twitter Data with Watson Tone Analyzer

In this post, I'll show you how to build a basic Spring app with Twitter login using Spring Social. Then we'll use Watson Tone Analyzer to determine the dominant emotion from each of the tweets on the time of the logged-in user. The project we will create will be similar to the Accessing Twitter Data Spring guide, but with a few modifications.

Continue reading

Arria brings Natural Language Generation to IBM Cloud

The Arria Natural Language Generation APIs service is an addition to the Finance category on the IBM Cloud platform. This blog post shows you how to get started with Arria’s Natural Language Generation APIs service on the IBM Cloud platform.

Continue reading