February 26, 2016 | Written by: Alexander Pogue
Categorized: Data Analytics | How-tos
Share this post:
Streams applications can integrate with HDFS in on-premise BigInsights clusters using the streamsx.hdfs toolkit. However, an extra layer of security in the cloud requires a special toolkit to access the BigInsights service in Bluemix. The HDFS for Bluemix toolkit contains Streams operators that can connect through the Knox Gateway. This article shows how to use these operators to read and write files to HDFS on Bluemix. The operators work in both a local Streams install and using the Streaming Analytics Bluemix service.
Why integrate with Hadoop?
Integrating with Hadoop in Bluemix empowers Streams applications.
- Reading from Hadoop enables Streams to ingest data both already in Hadoop and as it is stored to HDFS.
- Writing to Hadoop prepares the data for future data-at-rest analytics.
Write to HDFS
Import HDFSBluemixDemo into Streams Studio from the
streamsx.hdfs/samples/ directory. If the project doesn’t build successfully, make sure you have added the HDFS for Bluemix toolkit to Streams Studio. We’ll start with the TestWrite application in the demo project to demonstrate writing to HDFS.
In Streams Studio, launch the TestWrite application. Fill in the username, password, and url submission time values with your BigInsights credentials and webhdfs url. The webhdfs url will look similar to
webhdfs://[Ambari console ip]:8443. Leave the file parameter as-is to write to the
/tmp directory in HDFS.
Read more about submission time values here.
The TestWrite application writes to HDFS:
- Input (Beacon) outputs rstring tuples to print to the HDFS file.
- Sink (HDFS2FileSink) prints received tuples to remote HDFS file
- Log (Custom) prints the name of the file written by the HDFS2FileSink operator.
Viewing PE console output
The TestWrite application prints the name of the file to the Streams PE console.
View the demo’s output by viewing the instance graph, then showing PE Console of the printer operator. You should see the name and size of the newly created file:
Wrote 450 bytes to file hdfs_test_20151211_172531.txt
Viewing the file written to HDFS
The demo creates
write_test_[timestamp].txt on the Hadoop filesystem in the
To view the file, log onto the Hadoop system via SSH and run:
$ hadoop fs -ls /tmp
Read from HDFS
After verifying the HDFS file was written, we’ll try to read the file using Streams using the TestRead sample.
The TestRead application retrieves a file from HDFS and prints its contents:
- FromHDFS (HDFS2FileSource) reads the file back into Streams from HDFS, and outputs each line as an rstring tuple.
- PrintFileContents (Custom) prints each incoming rstring to the PE console.
Launch the application from Streams Studio, specifying the same credentials and url as the TestWrite application. Set the file parameter to the same value as was printed by the TestWrite sample.
After launching the TestRead application, view its output in the PE console of the PrintContents operator. You should see the file contents – ten numbered lines, each reading:
Output read from remote file: HDFS and Streams on Bluemix Test: New LINE #
Running the samples on Bluemix Streaming Analytics
The demo applications can also be deployed to the Streaming Analytics cloud service.
- Create a Streaming Analytics service by following the “Finding the service” section of Introduction to Bluemix Streaming Analytics.
- Click Launch in the Streaming Analytics Bluemix dashboard to launch the Streams console.
- From the Streams console, select “Submit Job” under the “play” icon.
- Browse for and select the
.sab file in your workspace directory:
- Click Next, enter the HDFS url, username, password, and file name as submission time values, and submit.
- After the operators start up, they will show a green circle in the Streams graph view. If not, resubmit and verify your submission time values.
After launching the TestWrite application, repeat the steps above to launch the TestRead sample. View its output by loading the Console Log of the PrintContents operator.
The file is printed “backwards” in the logs because the log viewer shows most recent logs first.
Bluemix HDFS operators enable Streams to read and write HDFS files on BigInsights. The HDFS for Bluemix toolkit works with Streams on-premise and in the Streaming Analytics Bluemix service.