How-tos

Get started with Streaming Analytics and BigInsights on Bluemix using HDFS

Share this post:

Overview

Streams applications can integrate with HDFS in on-premise BigInsights clusters using the streamsx.hdfs toolkit. However, an extra layer of security in the cloud requires a special toolkit to access the BigInsights service in Bluemix. The HDFS for Bluemix toolkit contains Streams operators that can connect through the Knox Gateway. This article shows how to use these operators to read and write files to HDFS on Bluemix. The operators work in both a local Streams install and using the Streaming Analytics Bluemix service.

Why integrate with Hadoop?

Integrating with Hadoop in Bluemix empowers Streams applications.

  • Reading from Hadoop enables Streams to ingest data both already in Hadoop and as it is stored to HDFS.
  • Writing to Hadoop prepares the data for future data-at-rest analytics.

Prerequisites

Write to HDFS

Import HDFSBluemixDemo into Streams Studio from the streamsx.hdfs/samples/ directory. If the project doesn’t build successfully, make sure you have added the HDFS for Bluemix toolkit to Streams Studio.  We’ll start with the TestWrite application in the demo project to demonstrate writing to HDFS.

In Streams Studio, launch the TestWrite application. Fill in the username, password, and url submission time values with your BigInsights credentials and webhdfs url. The webhdfs url will look similar to webhdfs://[Ambari console ip]:8443. Leave the file parameter as-is to write to the /tmp directory in HDFS.

Read more about submission time values here.

The TestWrite application writes to HDFS:

writegraph

  • Input (Beacon) outputs rstring tuples to print to the HDFS file.
  • Sink (HDFS2FileSink) prints received tuples to remote HDFS file write_test_[timestamp].txt.
  • Log (Custom) prints the name of the file written by the HDFS2FileSink operator.

Viewing PE console output

The TestWrite application prints the name of the file to the Streams PE console.

View the demo’s output by viewing the instance graph, then showing PE Console of the printer operator. You should see the name and size of the newly created file:

Wrote 450 bytes to file hdfs_test_20151211_172531.txt

Viewing the file written to HDFS

The demo creates write_test_[timestamp].txt on the Hadoop filesystem in the /tmp directory.

To view the file, log onto the Hadoop system via SSH and run:

$ hadoop fs -ls /tmp

Read from HDFS

After verifying the HDFS file was written, we’ll try to read the file using Streams using the TestRead sample.

The TestRead application retrieves a file from HDFS and prints its contents:

readgraph

  • FromHDFS (HDFS2FileSource) reads the file back into Streams from HDFS, and outputs each line as an rstring tuple.
  • PrintFileContents (Custom) prints each incoming rstring to the PE console.

Launch the application from Streams Studio, specifying the same credentials and url as the TestWrite application.  Set the file parameter to the same value as was printed by the TestWrite sample.

After launching the TestRead application, view its output in the PE console of the PrintContents operator. You should see the file contents – ten numbered lines, each reading:

Output read from remote file: HDFS and Streams on Bluemix Test: New LINE #

Running the samples on Bluemix Streaming Analytics

The demo applications can also be deployed to the Streaming Analytics cloud service.

  1. Create a Streaming Analytics service by following the “Finding the service” section of Introduction to Bluemix Streaming Analytics.
  2. Click Launch in the Streaming Analytics Bluemix dashboard to launch the Streams console.
  3. From the Streams console, select “Submit Job” under the “play” icon.
    Streaming Analytics submit job
  4. Browse for and select the .sab file in your workspace directory: workspace/HDFSBluemixDemo/output/hdfsexample.TestWrite>/Distributed
  5. Click Next, enter the HDFS url, username, password, and file name as submission time values, and submit.
  6. After the operators start up, they will show a green circle in the Streams graph view. If not, resubmit and verify your submission time values.

After launching the TestWrite application, repeat the steps above to launch the TestRead sample.  View its output by loading the Console Log of the PrintContents operator.

Streaming Analytics log view
The file is printed “backwards” in the logs because the log viewer shows most recent logs first.

Conclusion

Bluemix HDFS operators enable Streams to read and write HDFS files on BigInsights. The HDFS for Bluemix toolkit works with Streams on-premise and in the Streaming Analytics Bluemix service.

InfoSphere Streams Cloud

More How-tos stories
April 30, 2019

Introducing IBM Analytics Engine v1.2 and Announcing the Deprecation of IBM Analytics Engine v1.0

We are excited to inform you about the new version of IBM Analytics Engine v1.2 that will be available starting May 15, 2019. Along with this release, Analytics Engine v1.0 will be retired.

Continue reading

April 16, 2019

Announcing the Deprecation of the Decision Optimization Beta Service

The End of Beta date for the Decision Optimization service is May 17, 2019. The End of Beta Support date is June 20, 2019.

Continue reading

April 2, 2019

Data Refinery and Profiling Changes in Watson Studio and Watson Knowledge Catalog

We'd like to announce data refinery and profiling changes related to Watson Studio and Watson Knowledge Catalog that will take effect on May 17, 2019.

Continue reading