Contents


SAS software with Hortonworks Data Platform (HDP) running on IBM POWER8

Access and analyze big data stored on HDP using BASE SAS and SAS/ACCESS Interface to Hadoop

Comments

SAS/ACCESS Interface to Hadoop

SAS/ACCESS Interface to Hadoop provides the ability to access data sets stored in Hadoop, in SAS natively. With SAS/ACCESS to Hadoop:

  • LIBNAME statement can be used to make Hive tables look like SAS data sets, on which SAS procedures and SAS DATA steps can run.
  • PROC SQL commands provide the ability to run direct Hive SQL commands on a Hortonworks Data Platform (HDP) cluster.
  • PROC HADOOP provides the ability to directly submit MapReduce, Apache Pig, and HDFS commands from the SAS execution environment to a HDP cluster.

Objectives

Key objectives of the validation and testing of BASE SAS and SAS/ACCESS Interface to Hadoop with HDP cluster running on IBM POWER8 processor-based server are:

  • Configure SAS/ACCESS Interface to Hadoop to interface with HDP running on Red Hat Enterprise Linux (RHEL) on IBM® Power Systems™
  • Validate that BASE SAS can connect to HDP through SAS/ACCESS Interface to Hadoop
  • Demonstrate that BASE SAS can access and analyze data stored in HDP using various interfaces described above
  • Demonstrate that traditional SAS applications running on IBM AIX® on Power can leverage the distributed processing architecture and scalability of HDP running on IBM Power® servers

Test environment

High level components of the test environment include:

BASE SAS and SAS/ACCESS Interface to Hadoop

  • BASE SAS 9.4 (TS1M4)
  • SAS/ACCESS for Hadoop 9.4 (TS1M4)
  • AIX 7.2 (7200-00-01-1543)
  • Minimum resources: two virtual processors, 4 GB memory, 35 GB disk space
  • IBM PowerVM®
  • IBM POWER8® processor-based server

Hortonworks Data Platform

  • Hortonworks DataPlatform(HDP) version 2.5 (pre-GA release)
  • RHEL 7.2
  • Minimum resources: 16 virtual processor, 48 GB memory, 50 GB disk space
  • IBM PowerKVM™
  • IBM POWER8 processor-based server

Deployment architecture

Figure 1 describes the deployment and high-level architecture used to validate SAS software with HDP running on Power. The SAS software, consisting of BASE SAS and SAS/ACCESS Interface to Hadoop, was installed and configured on a virtual machine [or a logical partition (LPAR) in Power Systems terminology] running the IBM AIX 7.2 OS on an IBM POWER8 processor-based server. A single-node cluster of HDP was installed and configured on a virtual machine on RHEL 7.2 on a second IBM POWER8 processor-based server.

SAS/ACCESS Interface to Hadoop is configured to interface with and access data on HDP. Note that the installation and configuration of BASE SAS and SAS/ACCESS Interface to Hadoop is transparent to the number of nodes in the HDP cluster.

Figure 1. Architecture for BASE SAS and SAS/ACCESS Interface to Hadoop with HDP running on IBM Power servers
image001
image001

Installation and configuration

The section covers installation and configuration of a HDP cluster and SAS software – BASE SAS and SAS/ACESS Interface to Hadoop.

Installing the HDP cluster

High level steps to install and configure the HDP cluster:

  1. Follow the installation guide for HDP on Power Systems (given under references) to install and configure the HDP cluster.
  2. Log in to the Ambari server and ensure that all the services are running.
  3. Monitor and manage the HDP cluster, Hadoop, and related services through Ambari.

Setting up test data and Hive tables

Download the MovieLens and driver test data, copy the data to HDFS, and create Hive tables.

  1. Download the MovieLens data set from here (see the citation in References)
  2. Follow the instructions here to copy the MovieLens dataset data to HDFS and set up Hive external tables. Use hive user ID for the same.
  3. Download the driver data file from here.
  4. Copy the driver data to HDFS.
    # su – hive
    # hadoop fs -mkdir -p /user/hive/dataset/drivers
    # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers
    # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers
    # hadoop fs -ls /user/hive/dataset/drivers
    Found 2 items
    -rw-r--r--   3 hive hdfs       2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv
    -rw-r--r--   3 hive hdfs    2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv
  5. Create Hive tables for driver data.
    # su – hive
    # hive
    hive>create database trucks;
    hive> use trucks;
    
    hive> create table drivers
    (driverId int,
    name string,
    ssn bigint,
    location string,
    certified string,
    wageplan string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");
    
    hive> create table truck_events
    (driverId int,
    truckId int,
    eventTime string,
    eventType string,
    longitude double,
    latitude double,
    eventKey string,
    correlationId bigint,
    driverName string,
    routeId int,
    routeName string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");
    
    hive> show tables;
    OK
    drivers
    truck_events
  6. Load the data into the tables from the files in HDFS.
    hive> LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv' overwrite into table truck_events;
    hive> LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv' overwrite into table drivers;
  7. Cross check the tables to ensure that the data is present by running queries on the tables.

Install BASE SAS and SAS/ACCESS Interface to Hadoop

Perform the following steps to install BASE SAS and SAS/ACCESS Interface to Hadoop on AIX 7.2

  1. Obtain the software and dependencies from SAS. The software order contains all the details on how to download the software, along with the license file.
  2. Install, configure, and start VNC (Virtual Network Computing) Server on AIX LPAR and connect to the VNC session.
  3. Download SAS Download manager for AIX.
  4. Launch SAS Download Manager, provide the order number and installation key received from SAS. The download manager will download all the software that is tagged with the order number.
  5. Install the software:
    1. Launch the SAS Deployment Wizard by running setup.sh from the software download directory.
    2. Select SASHome. This is the location where SAS software is installed.
    3. Select the necessary options wherever required and complete the installation.
    4. Notice that $SASHOME/SASFoundation/9.4/sas is the executable to run SAS code.
Figure 2. Installing BASE SAS and SAS/ACCESS Interface to Hadoop using SAS Deployment Wizard
Figure 3. SASHome after completion of software installation

Configure BASE SAS and SAS/ACCESS Interface to Hadoop

After BASE SAS and SAS/ACCESS Interface to Hadoop are installed, follow the guide SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS for instructions on how to configure SAS/ACCESS Interface to Hadoop to connect to HDP cluster.

  • Verify the Hadoop environment as described in chapter 1 of the guide.
  • Follow chapter 4 of the guide, Configuring SAS/ACCESS for Hadoop, for detailed instructions.
  • To use SAS/ACCESS with a Hadoop server, a set of Hadoop JAR and configuration files must be available to the SAS client system. Refer to the Making Hadoop JAR and Configuration Files Available to the SAS Client Machine section for the details.
    • Launch the SAS Deployment Wizard by running the following command: $SASHOME/SASDeploymentManager/9.4/sasdm.sh
    • In the wizard, select AMBARI as the cluster manager and provide AMBARI server and access details.
    • Provide Hadoop cluster details, along with access information in the form of Secure Shell (SSH) private key to access cluster nodes.
    • Specify the location for Hadoop configuration (for example, /hadoop/conf) and Hadoop JAR files (for example, /hadoop/lib) on the SAS client system.
    • Select the Add environment variables option which updates the SAS configuration file sasv9.cfg with path names for Hadoop configuration and JAR files. The sasv9.cfg file will be updated with SAS_HADOOP_JAR_PATH and SAS_HADOOP_CONFIG_PATH environment variables.
    • Provide Hive service details along with Hive user name and password.
    • Follow the wizard to complete the configuration.
Figure 4. SAS Deployment Manager to configure SAS/ACCESS Interface to Hadoop
Figure 5. Snippet of Hadoop configuration and JAR file paths after completing the configuration
Figure 6. Snippet of SAS configuration file with Hadoop configuration and JAR file paths updated

Access and analyze data stored on HDP using BASE SAS

After completing the installation and configuration instructions described above, use BASE SAS to access and analyze data stored on HDP. Because the article covers validation of SAS/ACCESS Interface to Hadoop with HDP, we tried some of the important interfaces to access data and performed basic analysis. You can explore beyond the examples stated in the following sections.

Accessing data on HDFS

Use SAS PROC HADOOP and HDFS statements to access data stored in HDFS from BASE SAS. The following sample SAS code describes how to connect to Hadoop, write data to HDFS, and access data stored on HDFS. You can run the SAS code using the SAS command-line interface (CLI) on the SAS client system.

# cat proc-hadoop-test.sas

proc hadoop username='hdfs' password='xxxxxxxx' verbose;    

hdfs mkdir='/user/hdfs/sasdata';    

hdfs delete='/user/hdfs/test1';    

hdfs copytolocal='/user/hdfs/test/fromHDP.txt' 
    out='/home/np/SASCode/fromHDP.txt';

hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt'          
    out='/user/hdfs/sasdata/fromSAS.txt';
run;

# /SASHome/SASFoundation/9.4/sas proc-hadoop-test.sas

After running the code, check the log file for the status of execution.

# cat proc-hadoop-test.log
1 The SAS System 00:26
Saturday, April 22, 2017
NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M4)
Licensed to IBM CORP - FOR DEMO PURPOSES ONLY, Site 70219019.
NOTE: This session is executing on the AIX 7.2 (AIX 64) platform.
NOTE: Additional host information:
IBM AIX AIX 64 2 7 00F9C48F4C00
You are running SAS 9. Some SAS 8 files will be automatically converted by the V9 engine; others are incompatible. Please see
http://support.sas.com/rnd/migration/planning/platform/64bit.html
PROC MIGRATE will preserve current SAS file attributes and is
recommended for converting all your SAS libraries from any
SAS 8 release to SAS 9. For details and examples, please see
http://support.sas.com/rnd/migration/index.html
This message is contained in the SAS news file, and is presented upon initialization. Edit the file "news" in the "misc/base" directory to display site-specific news and information in the program log.
The command line option "-nonews" will prevent this display.

NOTE: SAS initialization used:
real time 0.03 seconds
cpu time 0.00 seconds
1 proc hadoop username='hdfs' password=XXXXXXXXXX verbose;
2 hdfs mkdir='/user/hdfs/sasdata';
3
4 hdfs delete='/user/hdfs/test1';
5
6 hdfs copytolocal='/user/hdfs/test/fromHDP.txt'
7 out='/home/np/SASCode/fromHDP.txt';
8
9 hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt'
10 out='/user/hdfs/sasdata/fromSAS.txt';
11 run;
NOTE: PROCEDURE HADOOP used (Total process time):
real time 4.49 seconds
cpu time 0.05 seconds
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 4.53 seconds
cpu time 0.05 seconds

Check the content of the local directory and HDFS to confirm that the SAS code is run successfully.

# hadoop fs -ls /user/hdfs/sasdata/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 30 2017-04-22 01:28 /user/hdfs/sasdata/fromSAS.txt
[hdfs@hdpnode1 ~]$ hadoop fs -cat /user/hdfs/sasdata/fromSAS.txt
Hello HDP user! Good morning!

# ls -ltr
total 136
-rw-r--r-- 1 root system 333 Apr 22 00:25 proc-hadoop-test.sas
-rw-r--r-- 1 root system 23 Apr 22 00:26 fromHDP.txt
-rw-r--r-- 1 root system 2042 Apr 22 00:26 proc-hadoop-test.log
l48fvp038_pub[/home/np/SASCode] > cat fromHDP.txt
Hello SAS user! Howdy?

Run Map Reduce programs on HDP from BASE SAS

You can run Map Reduce programs against data stored on HDP from BASE SAS. The MapReduce program will be run from BASE SAS using the PROC HADOOP SAS procedure. For example, if you have a large data set that needs some pre-processing before actual analysis using SAS, you can write a MapReduce program to perform the pre-processing on the HDP cluster.

The following sample SAS code copies a text file to HDFS and runs the WordCount Map Reduce program.

$ cat word-count-mr-ex.sas
proc hadoop username='hdfs' password='xxxxxxx' verbose;

   #Copy text file from client machine to HDFS 
   hdfs copyfromlocal='/home/np/SASCode/wordcount-input.txt'
             out='/user/hdfs/sasdata/wordcount-input.txt';

   #Map reduce statement to execute wordcount
   mapreduce input='/user/hdfs/sasdata/wordcount-input.txt'
      output='/user/hdfs/sasdata/wcout'
      jar='/hadoop/lib/hadoop-mapreduce-examples-2.7.3.2.5.0.0-1245.jar'
      outputkey='org.apache.hadoop.io.Text'
      outputvalue='org.apache.hadoop.io.IntWritable'
      reduce='org.apache.hadoop.examples.WordCount$IntSumReducer'
      combine='org.apache.hadoop.examples.WordCount$IntSumReducer'
      map='org.apache.hadoop.examples.WordCount$TokenizerMapper';
run;

# /SASHome/SASFoundation/9.4/sas word-count-mr-ex.sas

Run the sample code and check the SAS log file to ensure that the code ran successfully. Also, check the contents of the HDFS directory to verify the input file contents and output from the WordCount MapReduce program.

# hadoop fs -ls /user/hdfs/sasdata/
drwxr-xr-x - hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout
-rw-r--r-- 3 hdfs hdfs 268 2017-04-22 01:46 /user/hdfs/sasdata/wordcount-input.txt

# hadoop fs -cat /user/hdfs/sasdata/wordcount-input.txt
This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce
application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes
the word counts to the output text file.

# hadoop fs -ls /user/hdfs/sasdata/wcout
Found 2 items
-rw-r--r-- 3 hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout/_SUCCESS
-rw-r--r-- 3 hdfs hdfs 270 2017-04-22 01:47 /user/hdfs/sasdata/wcout/part-r-00000

# hadoop fs -cat /user/hdfs/sasdata/wcout/part-r-00000
HADOOP 1
Hadoop 2
MapReduce 2
PROC 1
The 1
This 1
WordCount 1
a 3
and 1
application 1
breaks 1
counts 2
each 1
example 2
file, 1
file. 1
input 1
into 1
line 1
output 1
program 1
reads 1
server. 1
submits 1
text 2
that 1
the 4
then 1
to 2
uses 1
word 1
words, 2
writes 1
Figure 7. WordCount Map Reduce job history from the Job History server of HDP

Access Hive tables as SAS data sets

Use the LIBNAME statement to access Hive tables as SAS data sets and PROC SQL procedure to run Hive queries against the HDP cluster through JDBC.

Refer to the following sample SAS code that uses LIBNAME and PROC SQL to access Hive tables and run queries and analysis from BASE SAS.

$ cat sas-hdp-hive-access.sas

#LIBNAME statement to connect to HIVEService2 on HDP Cluster
libname myhdp hadoop server='hdpnode1.dal-ebis.ihost.com' schema='default' user=hive;

# Listing of tables
proc datasets lib=myhdp details;
run;

# Get schema for all tables.
proc contents data=myhdp._all_;
run;

# Run queries and analysis using PROC SQL procedure
proc sql;
        select count(*) from MYHDP.TRUCK_EVENTS;
run;

# Run PROC FREQ statistical procedure against the HDFS data that #is not available as SAS dataset

proc freq data=MYHDP.TRUCK_EVENTS;
tables eventtype;
run;

Check the SAS log file for a list of tables. See the following excerpt from the log file listing the tables from the default Hive schema.

3          proc datasets lib=myhdp details;
                                                             
Libref         MYHDP
Engine         HADOOP
Physical Name  jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default
Schema/Owner   default

#  Name          Type     or Indexes   Vars  Label
1  DRIVERS       DATA         .          6
2  EXPORT_TABLE  DATA         .          2
3  EXTENSION     DATA         .          3
4  HOSTS         DATA         .         16
5  HOSTS1        DATA         .         16
6  HOSTS2        DATA         .         16
7  HOSTS3        DATA         .         16
8  TEST2         DATA         .          2
9  TRUCK_EVENTS  DATA         .         11

Check the listing file (with the .lst extension) for output of the statistical procedures. For example, the excerpt from the listing file shows all the table names and details for table truck_events.

The SAS System           02:19 Saturday, April 22, 2017   1

		The CONTENTS Procedure

Libref         MYHDP
Engine         HADOOP
Physical Name  jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default
Schema/Owner   default

#  Name          Type
1  DRIVERS       DATA
2  EXPORT_TABLE  DATA
3  EXTENSION     DATA
4  HOSTS         DATA
5  HOSTS1        DATA
6  HOSTS2        DATA
7  HOSTS3        DATA
8  TEST2         DATA
9  TRUCK_EVENTS  DATA


          The CONTENTS Procedure	

Data Set Name        MYHDP.TRUCK_EVENTS Observations          .
Member Type          DATA               Variables             11
Engine               HADOOP             Indexes               0
Created              .                  Observation Length    0
Last Modified        .                  Deleted Observations  0
Protection                              Compressed            NO
Data Set Type                           Sorted                NO
Label
Data Representation  Default
Encoding             Default

Alphabetic List of Variables and Attributes
#    Variable         Type      Len    Format     Informat    Label
8    correlationid    Num         8    20.        20.         correlationid
1    driverid         Num         8    11.        11.         driverid
9    drivername       Char    32767    $32767.    $32767.     drivername
7    eventkey         Char    32767    $32767.    $32767.     eventkey
3    eventtime        Char    32767    $32767.    $32767.     eventtime
4    eventtype        Char    32767    $32767.    $32767.     eventtype
6    latitude         Num         8                           latitude
5    longitude        Num         8                           longitude
10   routeid          Num         8    11.        11.         routeid
11   routename        Char    32767    $32767.    $32767.     routename
2    truckid          Num         8    11.        11.         truckid

The listing file also shows the output from statistical procedure PROC FREQ on the eventtype column of the truck_events table. See the following excerpt from the listing file.

                    The FREQ Procedure

                          eventtype

                                              Cumulative   Cumulative
eventtype                Frequency  Percent   Frequency    Percent
--------------------------------------------------------------------------
Lane Departure              11      0.06          11        0.06
Normal                     17041   99.80       17052       99.87
Overspeed                   9       0.05       17061       99.92
Unsafe following distance   7       0.04       17068       99.96
Unsafe tail distance        7       0.04       17075      100.00

Accessing Hive tables using explicit pass-through

Use the PROC SQL procedure along with the CONECT TO HADOOP and EXECUTE statements to run Hive queries in the explicit pass-through mode. This bypasses JDBC, and therefore, it is faster. This interface is helpful for streaming reads from Hive/HDFS.

The following sample code demonstrates this interface.

# cat sas-hdp-hive-access-explicit-passthru.sas
options dbidirectexec;
options nodbidirectexec;

proc sql;
   #Connect to Hadoop/HiveServer2 
   connect to hadoop (server=xxxxx.xxxxxxxx.com' user=hive subprotocol=hive2);

   #Execute HIVEQL query to create an external HIVE table
   execute ( CREATE EXTERNAL TABLE movie_ratings (
        userid INT,
        movieid INT,
        rating INT,
        tstamp STRING
        ) ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '#'
        STORED AS TEXTFILE
        LOCATION '/user/hive/sasdata/movie_ratings') by hadoop;

   disconnect from hadoop;

proc hadoop username='hive' password='xxxxxx' verbose;

        #Copy data from client to the HDFS location for the movie_ratings table.
        hdfs copyfromlocal='/home/np/SASCode/ratings.txt'
             out='/user/hive/sasdata/movie_ratings';

quit;

Log in to Hive on the HDP cluster and verify that the movie_ratings table is created and has the data.

# su - hive
# hive

hive> show tables;
OK
movie_ratings

hive> desc movie_ratings;
OK
userid int
movieid int
rating int
tstamp string
Time taken: 0.465 seconds, Fetched: 4 row(s)

hive> select count(*) from movie_ratings;
Query ID = hive_20170422042813_e9e49803-144a-48e9-b0f6-f5cd8595d254
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1492505822201_0034)
Map 1: -/- Reducer 2: 0/1
Map 1: 0/2 Reducer 2: 0/1
Map 1: 0/2 Reducer 2: 0/1
Map 1: 0(+1)/2 Reducer 2: 0/1
Map 1: 0(+2)/2 Reducer 2: 0/1
Map 1: 0(+2)/2 Reducer 2: 0/1
Map 1: 2/2 Reducer 2: 0(+1)/1
Map 1: 2/2 Reducer 2: 1/1
OK
1000209
Time taken: 14.931 seconds, Fetched: 1 row(s)

References


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, AIX and UNIX
ArticleID=1046379
ArticleTitle=SAS software with Hortonworks Data Platform (HDP) running on IBM POWER8
publish-date=07062017