SAS software with Hortonworks Data Platform (HDP) running on IBM POWER8
Access and analyze big data stored on HDP using BASE SAS and SAS/ACCESS Interface to Hadoop
SAS/ACCESS Interface to Hadoop
SAS/ACCESS Interface to Hadoop provides the ability to access data sets stored in Hadoop, in SAS natively. With SAS/ACCESS to Hadoop:
- LIBNAME statement can be used to make Hive tables look like SAS data sets, on which SAS procedures and SAS DATA steps can run.
- PROC SQL commands provide the ability to run direct Hive SQL commands on a Hortonworks Data Platform (HDP) cluster.
- PROC HADOOP provides the ability to directly submit MapReduce, Apache Pig, and HDFS commands from the SAS execution environment to a HDP cluster.
Key objectives of the validation and testing of BASE SAS and SAS/ACCESS Interface to Hadoop with HDP cluster running on IBM POWER8 processor-based server are:
- Configure SAS/ACCESS Interface to Hadoop to interface with HDP running on Red Hat Enterprise Linux (RHEL) on IBM® Power Systems™
- Validate that BASE SAS can connect to HDP through SAS/ACCESS Interface to Hadoop
- Demonstrate that BASE SAS can access and analyze data stored in HDP using various interfaces described above
- Demonstrate that traditional SAS applications running on IBM AIX® on Power can leverage the distributed processing architecture and scalability of HDP running on IBM Power® servers
High level components of the test environment include:
BASE SAS and SAS/ACCESS Interface to Hadoop
- BASE SAS 9.4 (TS1M4)
- SAS/ACCESS for Hadoop 9.4 (TS1M4)
- AIX 7.2 (7200-00-01-1543)
- Minimum resources: two virtual processors, 4 GB memory, 35 GB disk space
- IBM PowerVM®
- IBM POWER8® processor-based server
Hortonworks Data Platform
- Hortonworks DataPlatform(HDP) version 2.5 (pre-GA release)
- RHEL 7.2
- Minimum resources: 16 virtual processor, 48 GB memory, 50 GB disk space
- IBM PowerKVM™
- IBM POWER8 processor-based server
Figure 1 describes the deployment and high-level architecture used to validate SAS software with HDP running on Power. The SAS software, consisting of BASE SAS and SAS/ACCESS Interface to Hadoop, was installed and configured on a virtual machine [or a logical partition (LPAR) in Power Systems terminology] running the IBM AIX 7.2 OS on an IBM POWER8 processor-based server. A single-node cluster of HDP was installed and configured on a virtual machine on RHEL 7.2 on a second IBM POWER8 processor-based server.
SAS/ACCESS Interface to Hadoop is configured to interface with and access data on HDP. Note that the installation and configuration of BASE SAS and SAS/ACCESS Interface to Hadoop is transparent to the number of nodes in the HDP cluster.
Figure 1. Architecture for BASE SAS and SAS/ACCESS Interface to Hadoop with HDP running on IBM Power servers
Installation and configuration
The section covers installation and configuration of a HDP cluster and SAS software – BASE SAS and SAS/ACESS Interface to Hadoop.
Installing the HDP cluster
High level steps to install and configure the HDP cluster:
- Follow the installation guide for HDP on Power Systems (given under references) to install and configure the HDP cluster.
- Log in to the Ambari server and ensure that all the services are running.
- Monitor and manage the HDP cluster, Hadoop, and related services through Ambari.
Setting up test data and Hive tables
Download the MovieLens and driver test data, copy the data to HDFS, and create Hive tables.
- Download the MovieLens data set from here (see the citation in References)
- Follow the instructions here to copy the MovieLens dataset data to HDFS and set up Hive external tables. Use hive user ID for the same.
- Download the driver data file from here.
- Copy the driver data to HDFS.
# su – hive # hadoop fs -mkdir -p /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers # hadoop fs -ls /user/hive/dataset/drivers Found 2 items -rw-r--r-- 3 hive hdfs 2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv -rw-r--r-- 3 hive hdfs 2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv
- Create Hive tables for driver data.
# su – hive # hive hive>create database trucks; hive> use trucks; hive> create table drivers (driverId int, name string, ssn bigint, location string, certified string, wageplan string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> create table truck_events (driverId int, truckId int, eventTime string, eventType string, longitude double, latitude double, eventKey string, correlationId bigint, driverName string, routeId int, routeName string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> show tables; OK drivers truck_events
- Load the data into the tables from the files in HDFS.
hive> LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv' overwrite into table truck_events; hive> LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv' overwrite into table drivers;
- Cross check the tables to ensure that the data is present by running queries on the tables.
Install BASE SAS and SAS/ACCESS Interface to Hadoop
Perform the following steps to install BASE SAS and SAS/ACCESS Interface to Hadoop on AIX 7.2
- Obtain the software and dependencies from SAS. The software order contains all the details on how to download the software, along with the license file.
- Install, configure, and start VNC (Virtual Network Computing) Server on AIX LPAR and connect to the VNC session.
- Download SAS Download manager for AIX.
- Launch SAS Download Manager, provide the order number and installation key received from SAS. The download manager will download all the software that is tagged with the order number.
- Install the software:
- Launch the SAS Deployment Wizard by running setup.sh from the software download directory.
- Select SASHome. This is the location where SAS software is installed.
- Select the necessary options wherever required and complete the installation.
- Notice that $SASHOME/SASFoundation/9.4/sas is the executable to run SAS code.
Figure 2. Installing BASE SAS and SAS/ACCESS Interface to Hadoop using SAS Deployment Wizard
Figure 3. SASHome after completion of software installation
Configure BASE SAS and SAS/ACCESS Interface to Hadoop
After BASE SAS and SAS/ACCESS Interface to Hadoop are installed, follow the guide SAS 9.4 Hadoop Configuration Guide for Base SAS and SAS/ACCESS for instructions on how to configure SAS/ACCESS Interface to Hadoop to connect to HDP cluster.
- Verify the Hadoop environment as described in chapter 1 of the guide.
- Follow chapter 4 of the guide, Configuring SAS/ACCESS for Hadoop, for detailed instructions.
- To use SAS/ACCESS with a Hadoop server, a set of Hadoop JAR and
configuration files must be available to the SAS client system. Refer to the
Making Hadoop JAR and Configuration Files Available to the SAS
Client Machine section for the details.
- Launch the SAS Deployment Wizard by running the following command: $SASHOME/SASDeploymentManager/9.4/sasdm.sh
- In the wizard, select AMBARI as the cluster manager and provide AMBARI server and access details.
- Provide Hadoop cluster details, along with access information in the form of Secure Shell (SSH) private key to access cluster nodes.
- Specify the location for Hadoop configuration (for example, /hadoop/conf) and Hadoop JAR files (for example, /hadoop/lib) on the SAS client system.
- Select the Add environment variables option which updates the SAS configuration file sasv9.cfg with path names for Hadoop configuration and JAR files. The sasv9.cfg file will be updated with SAS_HADOOP_JAR_PATH and SAS_HADOOP_CONFIG_PATH environment variables.
- Provide Hive service details along with Hive user name and password.
- Follow the wizard to complete the configuration.
Figure 4. SAS Deployment Manager to configure SAS/ACCESS Interface to Hadoop
Figure 5. Snippet of Hadoop configuration and JAR file paths after completing the configuration
Figure 6. Snippet of SAS configuration file with Hadoop configuration and JAR file paths updated
Access and analyze data stored on HDP using BASE SAS
After completing the installation and configuration instructions described above, use BASE SAS to access and analyze data stored on HDP. Because the article covers validation of SAS/ACCESS Interface to Hadoop with HDP, we tried some of the important interfaces to access data and performed basic analysis. You can explore beyond the examples stated in the following sections.
Accessing data on HDFS
Use SAS PROC HADOOP and HDFS statements to access data stored in HDFS from BASE SAS. The following sample SAS code describes how to connect to Hadoop, write data to HDFS, and access data stored on HDFS. You can run the SAS code using the SAS command-line interface (CLI) on the SAS client system.
# cat proc-hadoop-test.sas proc hadoop username='hdfs' password='xxxxxxxx' verbose; hdfs mkdir='/user/hdfs/sasdata'; hdfs delete='/user/hdfs/test1'; hdfs copytolocal='/user/hdfs/test/fromHDP.txt' out='/home/np/SASCode/fromHDP.txt'; hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt' out='/user/hdfs/sasdata/fromSAS.txt'; run; # /SASHome/SASFoundation/9.4/sas proc-hadoop-test.sas
After running the code, check the log file for the status of execution.
# cat proc-hadoop-test.log 1 The SAS System 00:26 Saturday, April 22, 2017 NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) Proprietary Software 9.4 (TS1M4) Licensed to IBM CORP - FOR DEMO PURPOSES ONLY, Site 70219019. NOTE: This session is executing on the AIX 7.2 (AIX 64) platform. NOTE: Additional host information: IBM AIX AIX 64 2 7 00F9C48F4C00 You are running SAS 9. Some SAS 8 files will be automatically converted by the V9 engine; others are incompatible. Please see http://support.sas.com/rnd/migration/planning/platform/64bit.html PROC MIGRATE will preserve current SAS file attributes and is recommended for converting all your SAS libraries from any SAS 8 release to SAS 9. For details and examples, please see http://support.sas.com/rnd/migration/index.html This message is contained in the SAS news file, and is presented upon initialization. Edit the file "news" in the "misc/base" directory to display site-specific news and information in the program log. The command line option "-nonews" will prevent this display. NOTE: SAS initialization used: real time 0.03 seconds cpu time 0.00 seconds 1 proc hadoop username='hdfs' password=XXXXXXXXXX verbose; 2 hdfs mkdir='/user/hdfs/sasdata'; 3 4 hdfs delete='/user/hdfs/test1'; 5 6 hdfs copytolocal='/user/hdfs/test/fromHDP.txt' 7 out='/home/np/SASCode/fromHDP.txt'; 8 9 hdfs copyfromlocal='/home/np/SASCode/fromSAS.txt' 10 out='/user/hdfs/sasdata/fromSAS.txt'; 11 run; NOTE: PROCEDURE HADOOP used (Total process time): real time 4.49 seconds cpu time 0.05 seconds NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 NOTE: The SAS System used: real time 4.53 seconds cpu time 0.05 seconds
Check the content of the local directory and HDFS to confirm that the SAS code is run successfully.
# hadoop fs -ls /user/hdfs/sasdata/ Found 1 items -rw-r--r-- 3 hdfs hdfs 30 2017-04-22 01:28 /user/hdfs/sasdata/fromSAS.txt [hdfs@hdpnode1 ~]$ hadoop fs -cat /user/hdfs/sasdata/fromSAS.txt Hello HDP user! Good morning! # ls -ltr total 136 -rw-r--r-- 1 root system 333 Apr 22 00:25 proc-hadoop-test.sas -rw-r--r-- 1 root system 23 Apr 22 00:26 fromHDP.txt -rw-r--r-- 1 root system 2042 Apr 22 00:26 proc-hadoop-test.log l48fvp038_pub[/home/np/SASCode] > cat fromHDP.txt Hello SAS user! Howdy?
Run Map Reduce programs on HDP from BASE SAS
You can run Map Reduce programs against data stored on HDP from BASE SAS. The MapReduce program will be run from BASE SAS using the PROC HADOOP SAS procedure. For example, if you have a large data set that needs some pre-processing before actual analysis using SAS, you can write a MapReduce program to perform the pre-processing on the HDP cluster.
The following sample SAS code copies a text file to HDFS and runs the WordCount Map Reduce program.
$ cat word-count-mr-ex.sas proc hadoop username='hdfs' password='xxxxxxx' verbose; #Copy text file from client machine to HDFS hdfs copyfromlocal='/home/np/SASCode/wordcount-input.txt' out='/user/hdfs/sasdata/wordcount-input.txt'; #Map reduce statement to execute wordcount mapreduce input='/user/hdfs/sasdata/wordcount-input.txt' output='/user/hdfs/sasdata/wcout' jar='/hadoop/lib/hadoop-mapreduce-examples-220.127.116.11.5.0.0-1245.jar' outputkey='org.apache.hadoop.io.Text' outputvalue='org.apache.hadoop.io.IntWritable' reduce='org.apache.hadoop.examples.WordCount$IntSumReducer' combine='org.apache.hadoop.examples.WordCount$IntSumReducer' map='org.apache.hadoop.examples.WordCount$TokenizerMapper'; run; # /SASHome/SASFoundation/9.4/sas word-count-mr-ex.sas
Run the sample code and check the SAS log file to ensure that the code ran successfully. Also, check the contents of the HDFS directory to verify the input file contents and output from the WordCount MapReduce program.
# hadoop fs -ls /user/hdfs/sasdata/ drwxr-xr-x - hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout -rw-r--r-- 3 hdfs hdfs 268 2017-04-22 01:46 /user/hdfs/sasdata/wordcount-input.txt # hadoop fs -cat /user/hdfs/sasdata/wordcount-input.txt This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes the word counts to the output text file. # hadoop fs -ls /user/hdfs/sasdata/wcout Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-04-22 01:47 /user/hdfs/sasdata/wcout/_SUCCESS -rw-r--r-- 3 hdfs hdfs 270 2017-04-22 01:47 /user/hdfs/sasdata/wcout/part-r-00000 # hadoop fs -cat /user/hdfs/sasdata/wcout/part-r-00000 HADOOP 1 Hadoop 2 MapReduce 2 PROC 1 The 1 This 1 WordCount 1 a 3 and 1 application 1 breaks 1 counts 2 each 1 example 2 file, 1 file. 1 input 1 into 1 line 1 output 1 program 1 reads 1 server. 1 submits 1 text 2 that 1 the 4 then 1 to 2 uses 1 word 1 words, 2 writes 1
Figure 7. WordCount Map Reduce job history from the Job History server of HDP
Access Hive tables as SAS data sets
Use the LIBNAME statement to access Hive tables as SAS data sets and PROC SQL procedure to run Hive queries against the HDP cluster through JDBC.
Refer to the following sample SAS code that uses LIBNAME and PROC SQL to access Hive tables and run queries and analysis from BASE SAS.
$ cat sas-hdp-hive-access.sas #LIBNAME statement to connect to HIVEService2 on HDP Cluster libname myhdp hadoop server='hdpnode1.dal-ebis.ihost.com' schema='default' user=hive; # Listing of tables proc datasets lib=myhdp details; run; # Get schema for all tables. proc contents data=myhdp._all_; run; # Run queries and analysis using PROC SQL procedure proc sql; select count(*) from MYHDP.TRUCK_EVENTS; run; # Run PROC FREQ statistical procedure against the HDFS data that #is not available as SAS dataset proc freq data=MYHDP.TRUCK_EVENTS; tables eventtype; run;
Check the SAS log file for a list of tables. See the following excerpt from the log file listing the tables from the default Hive schema.
3 proc datasets lib=myhdp details; Libref MYHDP Engine HADOOP Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default Schema/Owner default # Name Type or Indexes Vars Label 1 DRIVERS DATA . 6 2 EXPORT_TABLE DATA . 2 3 EXTENSION DATA . 3 4 HOSTS DATA . 16 5 HOSTS1 DATA . 16 6 HOSTS2 DATA . 16 7 HOSTS3 DATA . 16 8 TEST2 DATA . 2 9 TRUCK_EVENTS DATA . 11
Check the listing file (with the
.lst extension) for
output of the statistical procedures. For example, the excerpt from the listing file
shows all the table names and details for table truck_events.
The SAS System 02:19 Saturday, April 22, 2017 1 The CONTENTS Procedure Libref MYHDP Engine HADOOP Physical Name jdbc:hive2://hdpnode1.dal-ebis.ihost.com:10000/default Schema/Owner default # Name Type 1 DRIVERS DATA 2 EXPORT_TABLE DATA 3 EXTENSION DATA 4 HOSTS DATA 5 HOSTS1 DATA 6 HOSTS2 DATA 7 HOSTS3 DATA 8 TEST2 DATA 9 TRUCK_EVENTS DATA The CONTENTS Procedure Data Set Name MYHDP.TRUCK_EVENTS Observations . Member Type DATA Variables 11 Engine HADOOP Indexes 0 Created . Observation Length 0 Last Modified . Deleted Observations 0 Protection Compressed NO Data Set Type Sorted NO Label Data Representation Default Encoding Default Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 8 correlationid Num 8 20. 20. correlationid 1 driverid Num 8 11. 11. driverid 9 drivername Char 32767 $32767. $32767. drivername 7 eventkey Char 32767 $32767. $32767. eventkey 3 eventtime Char 32767 $32767. $32767. eventtime 4 eventtype Char 32767 $32767. $32767. eventtype 6 latitude Num 8 latitude 5 longitude Num 8 longitude 10 routeid Num 8 11. 11. routeid 11 routename Char 32767 $32767. $32767. routename 2 truckid Num 8 11. 11. truckid
The listing file also shows the output from statistical procedure PROC FREQ on the eventtype column of the truck_events table. See the following excerpt from the listing file.
The FREQ Procedure eventtype Cumulative Cumulative eventtype Frequency Percent Frequency Percent -------------------------------------------------------------------------- Lane Departure 11 0.06 11 0.06 Normal 17041 99.80 17052 99.87 Overspeed 9 0.05 17061 99.92 Unsafe following distance 7 0.04 17068 99.96 Unsafe tail distance 7 0.04 17075 100.00
Accessing Hive tables using explicit pass-through
Use the PROC SQL procedure along with the CONECT TO HADOOP and EXECUTE statements to run Hive queries in the explicit pass-through mode. This bypasses JDBC, and therefore, it is faster. This interface is helpful for streaming reads from Hive/HDFS.
The following sample code demonstrates this interface.
# cat sas-hdp-hive-access-explicit-passthru.sas options dbidirectexec; options nodbidirectexec; proc sql; #Connect to Hadoop/HiveServer2 connect to hadoop (server=xxxxx.xxxxxxxx.com' user=hive subprotocol=hive2); #Execute HIVEQL query to create an external HIVE table execute ( CREATE EXTERNAL TABLE movie_ratings ( userid INT, movieid INT, rating INT, tstamp STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE LOCATION '/user/hive/sasdata/movie_ratings') by hadoop; disconnect from hadoop; proc hadoop username='hive' password='xxxxxx' verbose; #Copy data from client to the HDFS location for the movie_ratings table. hdfs copyfromlocal='/home/np/SASCode/ratings.txt' out='/user/hive/sasdata/movie_ratings'; quit;
Log in to Hive on the HDP cluster and verify that the movie_ratings table is created and has the data.
# su - hive # hive hive> show tables; OK movie_ratings hive> desc movie_ratings; OK userid int movieid int rating int tstamp string Time taken: 0.465 seconds, Fetched: 4 row(s) hive> select count(*) from movie_ratings; Query ID = hive_20170422042813_e9e49803-144a-48e9-b0f6-f5cd8595d254 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1492505822201_0034) Map 1: -/- Reducer 2: 0/1 Map 1: 0/2 Reducer 2: 0/1 Map 1: 0/2 Reducer 2: 0/1 Map 1: 0(+1)/2 Reducer 2: 0/1 Map 1: 0(+2)/2 Reducer 2: 0/1 Map 1: 0(+2)/2 Reducer 2: 0/1 Map 1: 2/2 Reducer 2: 0(+1)/1 Map 1: 2/2 Reducer 2: 1/1 OK 1000209 Time taken: 14.931 seconds, Fetched: 1 row(s)
- Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems
- Step by Step: Installation of HDP 2.6 on IBM Power Systems
- SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS®, Fourth Edition
- MovieLens data set citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872