Data Analytics

Get started with Streaming Analytics and BigInsights on Bluemix using HDFS

Overview

Streams applications can integrate with HDFS in on-premise BigInsights clusters using the streamsx.hdfs toolkit. However, an extra layer of security in the cloud requires a special toolkit to access the BigInsights service in Bluemix. The HDFS for Bluemix toolkit contains Streams operators that can connect through the Knox Gateway. This article shows how to use these operators to read and write files to HDFS on Bluemix. The operators work in both a local Streams install and using the Streaming Analytics Bluemix service.

Why integrate with Hadoop?

Integrating with Hadoop in Bluemix empowers Streams applications.

  • Reading from Hadoop enables Streams to ingest data both already in Hadoop and as it is stored to HDFS.
  • Writing to Hadoop prepares the data for future data-at-rest analytics.

Prerequisites

Write to HDFS

Import HDFSBluemixDemo into Streams Studio from the streamsx.hdfs/samples/ directory. If the project doesn’t build successfully, make sure you have added the HDFS for Bluemix toolkit to Streams Studio.  We’ll start with the TestWrite application in the demo project to demonstrate writing to HDFS.

In Streams Studio, launch the TestWrite application. Fill in the username, password, and url submission time values with your BigInsights credentials and webhdfs url. The webhdfs url will look similar to webhdfs://[Ambari console ip]:8443. Leave the file parameter as-is to write to the /tmp directory in HDFS.

Read more about submission time values here.

The TestWrite application writes to HDFS:

writegraph

  • Input (Beacon) outputs rstring tuples to print to the HDFS file.
  • Sink (HDFS2FileSink) prints received tuples to remote HDFS file write_test_[timestamp].txt.
  • Log (Custom) prints the name of the file written by the HDFS2FileSink operator.

Viewing PE console output

The TestWrite application prints the name of the file to the Streams PE console.

View the demo’s output by viewing the instance graph, then showing PE Console of the printer operator. You should see the name and size of the newly created file:

Wrote 450 bytes to file hdfs_test_20151211_172531.txt

Viewing the file written to HDFS

The demo creates write_test_[timestamp].txt on the Hadoop filesystem in the /tmp directory.

To view the file, log onto the Hadoop system via SSH and run:

$ hadoop fs -ls /tmp

Read from HDFS

After verifying the HDFS file was written, we’ll try to read the file using Streams using the TestRead sample.

The TestRead application retrieves a file from HDFS and prints its contents:

readgraph

  • FromHDFS (HDFS2FileSource) reads the file back into Streams from HDFS, and outputs each line as an rstring tuple.
  • PrintFileContents (Custom) prints each incoming rstring to the PE console.

Launch the application from Streams Studio, specifying the same credentials and url as the TestWrite application.  Set the file parameter to the same value as was printed by the TestWrite sample.

After launching the TestRead application, view its output in the PE console of the PrintContents operator. You should see the file contents – ten numbered lines, each reading:

Output read from remote file: HDFS and Streams on Bluemix Test: New LINE #

Running the samples on Bluemix Streaming Analytics

The demo applications can also be deployed to the Streaming Analytics cloud service.

  1. Create a Streaming Analytics service by following the “Finding the service” section of Introduction to Bluemix Streaming Analytics.
  2. Click Launch in the Streaming Analytics Bluemix dashboard to launch the Streams console.
  3. From the Streams console, select “Submit Job” under the “play” icon.
    Streaming Analytics submit job
  4. Browse for and select the .sab file in your workspace directory: workspace/HDFSBluemixDemo/output/hdfsexample.TestWrite>/Distributed
  5. Click Next, enter the HDFS url, username, password, and file name as submission time values, and submit.
  6. After the operators start up, they will show a green circle in the Streams graph view. If not, resubmit and verify your submission time values.

After launching the TestWrite application, repeat the steps above to launch the TestRead sample.  View its output by loading the Console Log of the PrintContents operator.

Streaming Analytics log view
The file is printed “backwards” in the logs because the log viewer shows most recent logs first.

Conclusion

Bluemix HDFS operators enable Streams to read and write HDFS files on BigInsights. The HDFS for Bluemix toolkit works with Streams on-premise and in the Streaming Analytics Bluemix service.

Share this post:

Share on LinkedIn

Add Comment
2 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


chris snow

On BigInsights on cloud, the ssl certificate presented with WebHDFS connections is self signed. Does streams bypass https certificate verification when connection to WebHDFS?

Reply

apogue

By default, Streams will accept all server certificates. If you need to validate SSL certificates, export the certificate and add it to your truststore. See step 4 of the documentation at http://ibmstreams.github.io/streamsx.hdfs/com.ibm.streamsx.hdfs/3.5_bluemix/doc/spldoc/html/tk$com.ibm.streamsx.hdfs/tk$com.ibm.streamsx.hdfs$1.html

Reply
More Data Analytics Stories

World of Watson: Bluemix sessions that you don’t want to miss

World of Watson kicks off on the 24th of October! Whether you’re going to be there in person or if you’re going to be joining the conference through the IBM app, we’ll be covering everything that you need to know about Bluemix and the conference right here. This post will whet your appetite for some of the sessions that are available to conference attendees and are special interest to Bluemix developers.

Water Conservation Starter Kit demonstrates Streaming Analytics and Apache Quarks

Since our demonstration of this application in a Google+ Hangout, we have been asked numerous times about how others can try out the Water Conservation Application. The Water Conservation Application makes a great starter kit, as it demonstrates the reference architecture for Apache Quarks and how it can be used with the Streaming Analytics service. It also shows how one can easily write and deploy an IoT application using multiple services from Bluemix.

Databases on tap with Compose on Bluemix

Compose databases are now available from the Bluemix catalog. Find out how you can use Compose for MongoDB, PostgreSQL, Redis, Elasticsearch, RethinkDB and RabbitMQ with your applications now.