Contents


SAP HANA and Hortonworks Data Platform (HDP) integration with SAP HANA Spark Controller running on IBM Power Systems

Combine in-memory processing power of SAP HANA with Big Data processing capabilities of Hadoop on IBM Power servers

Comments

This article describes how to use the strengths of SAP HANA and Hortonworks Data Platform (HDP) to process data. A new software component called SAP HANA Spark Controller is used to integrate HANA and HDP together allowing HANA the ability to access and process data stored in the HDP Hadoop cluster. The entire processing environment is running on IBM® POWER8® processor-based servers with Linux®. The article provides details on the test environment and two deployment models. Instructions for installation and configuration helps you set up a similar environment. At the end, SAP HANA Studio is used to test the data flow and validate that the environment is functioning properly.

SAP HANA Spark Controller

SAP HANA Spark Controller supports SAP HANA in-memory access to data stored on Hadoop cluster as Hadoop Distributed File System (HDFS) data files. Spark Controller allows SAP HANA to access Hadoop data through the SQL interface. Spark Controller primarily works with Spark SQL and connects to an existing Hive Metastore. Refer to the SAP HANA Spark Controller documentation for more details.

Spark Controller performs the following functions:

  • Facilitates query execution and enables SAP HANA to fetch data in a compressed columnar format
  • Supports SAP HANA-specific query optimizations and secure communication
  • Facilitates data transfer between SAP HANA and executor nodes

Objectives

Key objectives of the validation and testing are:

  • Validate that SAP HANA can be configured to interface with HDP running on Red Hat Enterprise Linux (RHEL) on IBM Power Systems™.
  • Validate that SAP HANA can connect to and access Hive tables (part of Hadoop) on the HDP cluster as virtual tables.
  • Access and visualize data from HDP using SAP HANA Studio as the client.
  • Test two deployment models using scale-up and scale-out IBM POWER8 processor-based servers.

Test environment

For the test environment, we used the following software versions and hardware configurations.

  • SAP HANA 2.0
    • Virtual machine [or a logical partition (LPAR)] on an IBM POWER8 processor-based server, with SUSE Linux Enterprise Server 12 (ppc64le) version = 12 and patch level = 1
    • 16 Virtual processors, 240 GB memory, 500 GB disk space
  • SAP HANA Spark Controller 2.1.0-1
    • This is installed on the HDP cluster.
    • It can be installed on only one node in a multi-node cluster. It should be able to reach the Hive Metastore node.
    • For this test, it was installed on the node that runs Apache Ambari, Node Manager, and YARN Resource Manager.
  • HDP 2.6
    • A four-node cluster was set up with HDP 2.6.
    • Each node is a virtual machine (or an LPAR) on IBM POWER8 processor-based server, with RHEL 7.2
    • 4 virtual processors (vCPU), 32 GB memory, 100 GB disk space for each node
  • SAP HANA Studio (client)
    • Version 2.3.5
    • Microsoft® Windows® 7 PC

Deployment

The test environment used two independent deployment models. The same software and configuration was used in each deployment environment; and the results of the tests were the same.

  • Both SAP HANA and HDP software were running on their own independent IBM POWER8 scale-out server (as shown in Figure 1).
  • Both SAP HANA and HDP software were running within their own independent VMs on the same enterprise POWER8 scale-up server (as shown in Figure 2).

Figure 1 and Figure 2 describe the two deployments and high-level architecture used to validate SAP HANA with HDP running on IBM Power® servers. SAP HANA was installed and configured on a virtual machine [or a logical partition (LPAR) in Power Systems terminology] running the IBM SUSE Linux version 12.1 OS on an IBM POWER8 processor-based server. A four-node cluster of HDP was installed and configured on four RHEL 7.2 virtual machines on a second IBM POWER8 processor-based server.

SAP HANA Spark Controller was installed on the master node that runs Ambari, Node Manager and YARN Resource Manager. However, Spark Controller can be installed on any one of the nodes in a multi-node HDP cluster. It should be able to reach the Hive Metastore node.

Note that the installation and configuration of SAP HANA is transparent to the number of nodes in the HDP cluster.

Figure 1. Scale-out deployment model for SAP HANA integrated with Hortonworks HDP using SAP HANA Spark Controller
Figure 2. Scale-up deployment model for SAP HANA integrated with Hortonworks HDP using SAP HANA Spark Controller

Installation and configuration steps

The section covers installation and configuration of a HDP cluster, SAP HANA Spark Controller, and SAP HANA on Power systems.

Installing the HDP cluster

You need to perform the following high-level steps to install and configure the HDP cluster:

  • Follow the installation guide for HDP on Power Systems to install and configure the HDP cluster.
  • Log in to the Ambari server and ensure that all the services are running.
  • Monitor and manage the HDP cluster, Hadoop, and related services through Ambari.

Setting up test data and Hive tables

Prepare test data by following the below steps.

  1. Download the MovieLens data set (see the citation in Reference).

    Follow the instructions to copy the MovieLens dataset data to HDFS and set up Hive external tables. Use hive user ID for the same.

  2. Download the drivers data and extract it into local file system.
  3. Copy the driver data to HDFS.
    # su – hive
    # hadoop fs -mkdir -p /user/hive/dataset/drivers
    # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers
    # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers
    # hadoop fs -ls /user/hive/dataset/drivers
    Found 2 items
    -rw-r--r--   3 hive hdfs       2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv
    -rw-r--r--   3 hive hdfs    2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv
  4. Create Hive tables for the driver data.
    # su – hive
    # hive
    hive>create database trucks;
    hive> use trucks;
    
    hive> create table drivers
    (driverId int,
    name string,
    ssn bigint,
    location string,
    certified string,
    wageplan string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");
    
    hive> create table truck_events
    (driverId int,
    truckId int,
    eventTime string,
    eventType string,
    longitude double,
    latitude double,
    eventKey string,
    correlationId bigint,
    driverName string,
    routeId int,
    routeName string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES("skip.header.line.count"="1");
    
    hive> show tables;
    OK
    drivers
    truck_events
  5. Load the data into the tables from the files in HDFS.
    hive> LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv' overwrite into table truck_events;
    hive> LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv' overwrite into table drivers;
  6. Cross check the tables to ensure that the data is present by running queries on the tables.

Install and configure SAP HANA

Install and configure SAP HANA by performing the following steps:

  1. Install and configure SAP HANA on your POWER8 processor-based server or VM. Use the planning guides for Power Systems.
  2. Start the SAP HANA database if it is not already running. Use the following commands to stop and start the server.
    saphana2:~> su - hndadm
    Password:
    hndadm@saphana2:/usr/sap/HND/HDB00> ps -ef | grep -i hdb
    hndadm     5724      1  0 Mar05 ?        00:00:06 /usr/sap/HND/HDB00/exe/sapstartsrv pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 -D -u hndadm
    hndadm     5848      1  0 Mar05 ?        00:00:00 sapstart pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2
    hndadm     5893   5848  0 Mar05 ?        00:00:03 /usr/sap/HND/HDB00/saphana2/trace/hdb.sapHND_HDB00 -d -nw -f /usr/sap/HND/HDB00/saphana2/daemon.ini pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2
    hndadm     5962   5893  0 Mar05 ?        00:22:49 hdbnameserver
    hndadm     6044   5893  0 Mar05 ?        00:09:04 hdbcompileserver
    hndadm     6046   5893  0 Mar05 ?        00:12:27 hdbpreprocessor
    hndadm     6078   5893  9 Mar05 ?        07:06:29 hdbindexserver
    hndadm     6080   5893  1 Mar05 ?        01:04:24 hdbxsengine
    hndadm     6705   5893  0 Mar05 ?        00:09:15 hdbwebdispatcher
    hndadm@saphana2:/usr/sap/HND/HDB00>HDB stop
    hndadm@saphana2:/usr/sap/HND/HDB00>HDB start
    hndadm@saphana2:/usr/sap/HND/HDB00> HDB proc
    USER        PID   PPID %CPU    VSZ   RSS COMMAND
    hndadm    39940  39882  0.1   8640  5952  \_ bash
    hndadm    39992  39940 16.6   6912  4736      \_ /bin/sh /usr/sap/HND/HDB00/HDB proc
    hndadm    40021  39992  0.0   6912  1088          \_ /bin/sh /usr/sap/HND/HDB00/HDB proc
    hndadm     7218      1  0.0   9344  4224 sapstart pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2
    hndadm     7226   7218  0.0 316928 57984  \_ /usr/sap/HND/HDB00/saphana2/trace/hdb.sapHND_HDB00 -d -nw -f /usr/sap/HND/HDB00/saphana2/daemon.ini pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2
    hndadm     7242   7226  0.5 8541440 5080576      \_ hdbnameserver
    hndadm     7322   7226  0.1 6521536 1096128      \_ hdbcompileserver
    hndadm     7324   7226  0.2 6490624 764672      \_ hdbpreprocessor
    hndadm     7356   7226  9.7 28182336 24624640      \_ hdbindexserver
    hndadm     7358   7226  1.6 7534784 2313280      \_ hdbxsengine
    hndadm     7972   7226  0.1 6732480 953600      \_ hdbwebdispatcher
    hndadm     5724      1  0.0 538816 35648 /usr/sap/HND/HDB00/exe/sapstartsrv pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 -D -u hndadm

Install and configure SAP HANA Studio

HANA Studio is an integrated development environment (IDE) that provides many capabilities when interacting with SAP HANA. It can be used to access and visualize data assets in SAP HANA. Install the software on your client system such as a Windows PC.

Install and configure SAP HANA Spark Controller

Perform the following steps to install and configure SAP HANA Spark Controller:

  1. Download Spark Controller from SAP Software Download.
  2. Follow the guide for installing and configuring Spark Controller.
  3. On a multi-node HDP cluster, the SAP HANA Spark Controller can be installed only on one node. Consider installing it on the node that runs Ambari, Name Node, and YARN Resource Manager. That node must be configured to communicate with the Hive metastore.
  4. You can install and configure the Spark Controller either manually or by using Ambari.
  5. Start Spark Controller and check the Spark Controller log (/var/log/hanaes/hana_controller.log) to understand and debug issues, if any.
  6. Check if the Spark Controller is running and is listening at port 7860 (by default).
    [root@hdpnode1 conf]# ps -aef |grep -i ^hanaes
    hanaes    6064     1  0 Mar25 ?        02:37:24 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.ppc64le//bin/java -cp /usr/hdp/current/spark-client/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar::/etc/hive/conf:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:mysql-connector-java.jar:mysql-connector-java.jar:/usr/hdp/2.5.0.0-1245/tez/*:/usr/hdp/2.5.0.0-1245/tez/lib/*:/usr/hdp/2.5.0.0-1245/tez/conf:/usr/sap/spark/controller/bin/../conf:/usr/hdp/2.5.0.0-1245/hadoop/conf:/etc/hive/conf:../*:../lib/*:/usr/hdp/2.5.0.0-1245/hadoop/*:/usr/hdp/2.5.0.0-1245/hadoop/lib/*:/usr/hdp/2.5.0.0-1245/hadoop-hdfs/*:/usr/hdp/2.5.0.0-1245/hadoop-hdfs/lib/* -XX:PermSize=128m -XX:MaxPermSize=256m -Xmx8172m com.sap.hana.spark.network.Launcher
    [root@hdpnode1 conf]# netstat -anp | grep LISTEN |grep 7860
    tcp6       0      0 :::7860                 :::*                    LISTEN      6064/java

Accessing data on HDP from SAP HANA database for analysis

Follow the steps below to access data stored on HDP from SAP HANA.

  1. Add the SAP HANA 2.0 server as a server in the SAP HANA Studio as shown in Figure 3.
    Figure 3. Configuring SAP HANA Studio with a test SAP HANA server
  2. From HANA Studio, create a new remote connection to connect to Hive Server2 running on HDP and access Hive tables.
    • In the HANA Studio window, click Open SQL Console for HANA system. The HANA SQL console is displayed on the right pane.
    • Enter the following SQL query to create a new connection to HDP from SAP HANA. Change the names and IP addresses as per your environment.

      CREATE REMOTE SOURCE "HDP Demo" ADAPTER "sparksql"     
      CONFIGURATION 'port=7860;ssl_mode=disabled;server=xx.xx.xx.xx'        
      WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hanaes;password=hanaes';

      A new remote connection by name HDP Demo is created.

    • Click the remote connection to explore the Hive databases and tables within it.
    Figure 4. Create a new HDP remote connection to explore Hive databases and tables
  3. Create a new schema under HANA DB for Hive tables and create virtual tables. Run the following SQL query under SQL Console. A new schema (named hdp) will be created.

    create SCHEMA hdp OWNED BY SYSTEM;
    Figure 5. Creating a new schema
  4. Create HANA virtual tables (under the newly created schema) for the Hive tables that you want to access and analyze data (as shown in Figure 6, Figure 7, and Figure 8).
    Figure 6. Creating a virtual table on SAP HANA
    Figure 7. Creating a virtual table on SAP HANA
    Figure 8. Creating a virtual table on SAP HANA
  5. Access data through virtual tables and start analyzing. You can perform various kinds of data visualization using the Visualization tab on HANA Studio.
    Figure 9. Access data on Hadoop through HANA virtual tables
    Figure 10. Perform queries on Hive tables through HANA virtual tables
    Figure 11. Perform visualization on data from Hive tables through HANA virtual tables
    Figure 12. Perform visualization on data from Hive tables through HANA virtual tables
    Figure 13. Perform visualization on data from Hive tables through HANA virtual tables

Summary

This article explained how to set up a test environment where SAP HANA can interface with HDP running entirely on IBM Power Systems. SAP HANA was able to connect to and access data stored on HDFS through Hive tables. SAP HANA Studio was used as a client to access and visualize data from HDP. Two deployment models are available using scale-up and scale-out POWER8 servers.

References


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1048899
ArticleTitle=SAP HANA and Hortonworks Data Platform (HDP) integration with SAP HANA Spark Controller running on IBM Power Systems
publish-date=08302017