Toolkit com.ibm.streamsx.hdfs 5.2.1

Specialized toolkits - release 4.3.1.0-i20200220 > com.ibm.streamsx.hdfs 5.2.1

General Information

The streamsx.hdfs toolkit provides operators that can read and write data from Hadoop Distributed File System (HDFS) version 2 or later. It supports also copy files from local file system to the remote HDFS and from HDFS to the local file system.

HDFS2FileSource This operator opens a file on HDFS and sends out its contents in tuple format on its output port.

HDFS2FileSink This operator writes tuples that arrive on its input port to the output file that is named by the file parameter.

HDFS2DirectoryScan This operator repeatedly scans an HDFS directory and writes the names of new or modified files that are found in the directory to the output port. The operator sleeps between scans.

HDFS2FileCopy This operator copies files from local file system to the remote HDFS and from HDFS to the local disk.

The operators in this toolkit use Hadoop Java APIs to access HDFS and WEBHDFS. The operators support the following versions of Hadoop distributions:
  • Apache Hadoop versions 2.7
  • Apache Hadoop versions 3.0 or higher
  • Cloudera distribution including Apache Hadoop version 4 (CDH4) and version 5 (CDH 5)
  • Hortonworks Data Platform (HDP) Version 2.6 or higher
  • Hortonworks Data Platform (HDP) Version 3.0 or higher
  • IBM Analytic Engine 1.1 (HDP 2.7)
  • IBM Analytic Engine 1.2 (HDP 3.1)

Note: The reference platforms that were used for testing are Hadoop 2.7.3, HDP .

You can access Hadoop remotely by specifying the webhdfs://hdfshost:webhdfsport schema in the URI that you use to connect to WEBHDFS.

For example:


  () as lineSink1 = HDFS2FileSink(LineIn)
  {  
         param
              hdfsUri       : "clsadmin": 
              hdfsUser      : "webhdfs://your-hdfs-host-ip-address:8443";
              hdfsPassword  : "PASSWORD";
              file          : "LineInput.txt" ;
              
  }

Or "hdfs://your-hdfs-host-ip-address:8020" as hdfsUri


  () as lineSink1 = HDFS2FileSink(LineIn)
  {  
         param
              hdfsUri  : "hdfs": 
              hdfsUser : "hdfs://your-hdfs-host-ip-address:8020"
              file     : "LineInput.txt" ;                  
  }

This example copies the file test.txt from local path ./data/ into /user/hdfs/works directory of HDFS. The parameter credentials is a JSON string that contains user, password and webhdfs.


  streams<boolean succeed> copyFromLocal =  HDFS2FileCopy()
  {
         param
              localFile                : "test.txt"; 
              hdfsFile                 : "/user/hdfs/works/test.txt"; 
              deleteSourceFile         : false; 
              overwriteDestinationFile : true; 
              direction                : copyFromLocalFile;
              credentials              : $credentials ;
  }

Kerberos configuration

For Apache Hadoop 2.x, CDH, and HDP, you can optionally configure these operators to use the Kerberos protocol to authenticate users that read and write to HDFS.

Kerberos authentication provides a more secure way of accessing HDFS by providing user authentication.

To use Kerberos authentication, you must configure the authPrincipal and authKeytab operator parameters at compile time.

The authPrincipal parameter specifies the Kerberos principal, which is typically the principal that is created for the Streams instance owner.

The authKeytab parameter specifies the keytab file that is created for the principal.

For Kerberos authentication it is required to create a Principal and a Keytab for each user.

If you use ambari to configure your hadoop server, you can create principals and keytabs via ambari (Enable Kerberos).

More details about Kerberos configuration:


  https://developer.ibm.com/hadoop/2016/08/18/overview-of-kerberos-in-iop-4-2/

Copy the created keytab into local streams server for example in etc directory of your SPL application.

Before you start your SPL application, you can check the keytab with kinit tool


  kinit -k -t KeytabPath Principal

KeytabPath is the full path to the keytab file

For example:


  kinit -k -t /home/streamsadmin/workspace/myproject/etc/hdfs.headless.keytab hdfs-hdp2@HDP2.COM

In this case HDP2.com is the kerebors realm and the user is hdfs.

Here is an SPL example to write a file into hadoop server with kerberos configuration.


    () as lineSink1 = HDFS2FileSink(LineIn)
    {
        param
            authKeytab     : "etc/hdfs.headless.keytab" ;
            authPrincipal  : "hdfs-hdp2@HDP2.COM" ;
            configPath     : "etc" ;
            file           : "LineInput.txt" ;
    }

The HDSF configuration file core-site.xml has to be copied into local etc directory.

Developing and running applications that use the HDFS Toolkit
What is new
Version
5.2.1
Required Product Version
4.2.0.0

Indexes

Namespaces
Operators

Namespaces

com.ibm.streamsx.hdfs
Operators