SAP HANA and Hortonworks Data Platform (HDP) integration with SAP HANA Spark Controller running on IBM Power Systems
Combine in-memory processing power of SAP HANA with Big Data processing capabilities of Hadoop on IBM Power servers
This article describes how to use the strengths of SAP HANA and Hortonworks Data Platform (HDP) to process data. A new software component called SAP HANA Spark Controller is used to integrate HANA and HDP together allowing HANA the ability to access and process data stored in the HDP Hadoop cluster. The entire processing environment is running on IBM® POWER8® processor-based servers with Linux®. The article provides details on the test environment and two deployment models. Instructions for installation and configuration helps you set up a similar environment. At the end, SAP HANA Studio is used to test the data flow and validate that the environment is functioning properly.
SAP HANA Spark Controller
SAP HANA Spark Controller supports SAP HANA in-memory access to data stored on Hadoop cluster as Hadoop Distributed File System (HDFS) data files. Spark Controller allows SAP HANA to access Hadoop data through the SQL interface. Spark Controller primarily works with Spark SQL and connects to an existing Hive Metastore. Refer to the SAP HANA Spark Controller documentation for more details.
Spark Controller performs the following functions:
- Facilitates query execution and enables SAP HANA to fetch data in a compressed columnar format
- Supports SAP HANA-specific query optimizations and secure communication
- Facilitates data transfer between SAP HANA and executor nodes
Key objectives of the validation and testing are:
- Validate that SAP HANA can be configured to interface with HDP running on Red Hat Enterprise Linux (RHEL) on IBM Power Systems™.
- Validate that SAP HANA can connect to and access Hive tables (part of Hadoop) on the HDP cluster as virtual tables.
- Access and visualize data from HDP using SAP HANA Studio as the client.
- Test two deployment models using scale-up and scale-out IBM POWER8 processor-based servers.
For the test environment, we used the following software versions and hardware configurations.
- SAP HANA 2.0
- Virtual machine [or a logical partition (LPAR)] on an IBM POWER8 processor-based server, with SUSE Linux Enterprise Server 12 (ppc64le) version = 12 and patch level = 1
- 16 Virtual processors, 240 GB memory, 500 GB disk space
- SAP HANA Spark Controller 2.1.0-1
- This is installed on the HDP cluster.
- It can be installed on only one node in a multi-node cluster. It should be able to reach the Hive Metastore node.
- For this test, it was installed on the node that runs Apache Ambari, Node Manager, and YARN Resource Manager.
- HDP 2.6
- A four-node cluster was set up with HDP 2.6.
- Each node is a virtual machine (or an LPAR) on IBM POWER8 processor-based server, with RHEL 7.2
- 4 virtual processors (vCPU), 32 GB memory, 100 GB disk space for each node
- SAP HANA Studio (client)
- Version 2.3.5
- Microsoft® Windows® 7 PC
The test environment used two independent deployment models. The same software and configuration was used in each deployment environment; and the results of the tests were the same.
- Both SAP HANA and HDP software were running on their own independent IBM POWER8 scale-out server (as shown in Figure 1).
- Both SAP HANA and HDP software were running within their own independent VMs on the same enterprise POWER8 scale-up server (as shown in Figure 2).
Figure 1 and Figure 2 describe the two deployments and high-level architecture used to validate SAP HANA with HDP running on IBM Power® servers. SAP HANA was installed and configured on a virtual machine [or a logical partition (LPAR) in Power Systems terminology] running the IBM SUSE Linux version 12.1 OS on an IBM POWER8 processor-based server. A four-node cluster of HDP was installed and configured on four RHEL 7.2 virtual machines on a second IBM POWER8 processor-based server.
SAP HANA Spark Controller was installed on the master node that runs Ambari, Node Manager and YARN Resource Manager. However, Spark Controller can be installed on any one of the nodes in a multi-node HDP cluster. It should be able to reach the Hive Metastore node.
Note that the installation and configuration of SAP HANA is transparent to the number of nodes in the HDP cluster.
Figure 1. Scale-out deployment model for SAP HANA integrated with Hortonworks HDP using SAP HANA Spark Controller
Figure 2. Scale-up deployment model for SAP HANA integrated with Hortonworks HDP using SAP HANA Spark Controller
Installation and configuration steps
The section covers installation and configuration of a HDP cluster, SAP HANA Spark Controller, and SAP HANA on Power systems.
Installing the HDP cluster
You need to perform the following high-level steps to install and configure the HDP cluster:
- Follow the installation guide for HDP on Power Systems to install and configure the HDP cluster.
- Log in to the Ambari server and ensure that all the services are running.
- Monitor and manage the HDP cluster, Hadoop, and related services through Ambari.
Setting up test data and Hive tables
Prepare test data by following the below steps.
- Download the MovieLens data set (see the citation in Reference).
Follow the instructions to copy the MovieLens dataset data to HDFS and set up Hive external tables. Use hive user ID for the same.
- Download the drivers data and extract it into local file system.
- Copy the driver data to HDFS.
# su – hive # hadoop fs -mkdir -p /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/truck_event_text_partition.csv /user/hive/dataset/drivers # hadoop fs -copyFromLocal /home/np/u0014213/Data/drivers.csv /user/hive/dataset/drivers # hadoop fs -ls /user/hive/dataset/drivers Found 2 items -rw-r--r-- 3 hive hdfs 2043 2017-05-21 06:30 /user/hive/dataset/drivers/drivers.csv -rw-r--r-- 3 hive hdfs 2272077 2017-05-21 06:30 /user/hive/dataset/drivers/truck_event_text_partition.csv
- Create Hive tables for the driver data.
# su – hive # hive hive>create database trucks; hive> use trucks; hive> create table drivers (driverId int, name string, ssn bigint, location string, certified string, wageplan string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> create table truck_events (driverId int, truckId int, eventTime string, eventType string, longitude double, latitude double, eventKey string, correlationId bigint, driverName string, routeId int, routeName string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); hive> show tables; OK drivers truck_events
- Load the data into the tables from the files in HDFS.
hive> LOAD DATA INPATH '/user/hive/dataset/drivers/truck_event_text_partition.csv' overwrite into table truck_events; hive> LOAD DATA INPATH '/user/hive/dataset/drivers/drivers.csv' overwrite into table drivers;
- Cross check the tables to ensure that the data is present by running queries on the tables.
Install and configure SAP HANA
Install and configure SAP HANA by performing the following steps:
- Install and configure SAP HANA on your POWER8 processor-based server or VM. Use the planning guides for Power Systems.
- Start the SAP HANA database if it is not already running. Use the following
commands to stop and start the server.
saphana2:~> su - hndadm Password: hndadm@saphana2:/usr/sap/HND/HDB00> ps -ef | grep -i hdb hndadm 5724 1 0 Mar05 ? 00:00:06 /usr/sap/HND/HDB00/exe/sapstartsrv pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 -D -u hndadm hndadm 5848 1 0 Mar05 ? 00:00:00 sapstart pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 hndadm 5893 5848 0 Mar05 ? 00:00:03 /usr/sap/HND/HDB00/saphana2/trace/hdb.sapHND_HDB00 -d -nw -f /usr/sap/HND/HDB00/saphana2/daemon.ini pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 hndadm 5962 5893 0 Mar05 ? 00:22:49 hdbnameserver hndadm 6044 5893 0 Mar05 ? 00:09:04 hdbcompileserver hndadm 6046 5893 0 Mar05 ? 00:12:27 hdbpreprocessor hndadm 6078 5893 9 Mar05 ? 07:06:29 hdbindexserver hndadm 6080 5893 1 Mar05 ? 01:04:24 hdbxsengine hndadm 6705 5893 0 Mar05 ? 00:09:15 hdbwebdispatcher hndadm@saphana2:/usr/sap/HND/HDB00>HDB stop hndadm@saphana2:/usr/sap/HND/HDB00>HDB start hndadm@saphana2:/usr/sap/HND/HDB00> HDB proc USER PID PPID %CPU VSZ RSS COMMAND hndadm 39940 39882 0.1 8640 5952 \_ bash hndadm 39992 39940 16.6 6912 4736 \_ /bin/sh /usr/sap/HND/HDB00/HDB proc hndadm 40021 39992 0.0 6912 1088 \_ /bin/sh /usr/sap/HND/HDB00/HDB proc hndadm 7218 1 0.0 9344 4224 sapstart pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 hndadm 7226 7218 0.0 316928 57984 \_ /usr/sap/HND/HDB00/saphana2/trace/hdb.sapHND_HDB00 -d -nw -f /usr/sap/HND/HDB00/saphana2/daemon.ini pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 hndadm 7242 7226 0.5 8541440 5080576 \_ hdbnameserver hndadm 7322 7226 0.1 6521536 1096128 \_ hdbcompileserver hndadm 7324 7226 0.2 6490624 764672 \_ hdbpreprocessor hndadm 7356 7226 9.7 28182336 24624640 \_ hdbindexserver hndadm 7358 7226 1.6 7534784 2313280 \_ hdbxsengine hndadm 7972 7226 0.1 6732480 953600 \_ hdbwebdispatcher hndadm 5724 1 0.0 538816 35648 /usr/sap/HND/HDB00/exe/sapstartsrv pf=/usr/sap/HND/SYS/profile/HND_HDB00_saphana2 -D -u hndadm
Install and configure SAP HANA Studio
HANA Studio is an integrated development environment (IDE) that provides many capabilities when interacting with SAP HANA. It can be used to access and visualize data assets in SAP HANA. Install the software on your client system such as a Windows PC.
Install and configure SAP HANA Spark Controller
Perform the following steps to install and configure SAP HANA Spark Controller:
- Download Spark Controller from SAP Software Download.
- Follow the guide for installing and configuring Spark Controller.
- On a multi-node HDP cluster, the SAP HANA Spark Controller can be installed only on one node. Consider installing it on the node that runs Ambari, Name Node, and YARN Resource Manager. That node must be configured to communicate with the Hive metastore.
- You can install and configure the Spark Controller either manually or by using Ambari.
- Start Spark Controller and check the Spark Controller log (/var/log/hanaes/hana_controller.log) to understand and debug issues, if any.
- Check if the Spark Controller is running and is listening at port 7860 (by
[root@hdpnode1 conf]# ps -aef |grep -i ^hanaes hanaes 6064 1 0 Mar25 ? 02:37:24 /usr/lib/jvm/java-1.8.0-openjdk-184.108.40.206-1.b15.el7_2.ppc64le//bin/java -cp /usr/hdp/current/spark-client/lib/spark-assembly-220.127.116.11.5.0.0-1245-hadoop18.104.22.168.5.0.0-1245.jar::/etc/hive/conf:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:mysql-connector-java.jar:mysql-connector-java.jar:/usr/hdp/22.214.171.124-1245/tez/*:/usr/hdp/126.96.36.199-1245/tez/lib/*:/usr/hdp/188.8.131.52-1245/tez/conf:/usr/sap/spark/controller/bin/../conf:/usr/hdp/184.108.40.206-1245/hadoop/conf:/etc/hive/conf:../*:../lib/*:/usr/hdp/220.127.116.11-1245/hadoop/*:/usr/hdp/18.104.22.168-1245/hadoop/lib/*:/usr/hdp/22.214.171.124-1245/hadoop-hdfs/*:/usr/hdp/126.96.36.199-1245/hadoop-hdfs/lib/* -XX:PermSize=128m -XX:MaxPermSize=256m -Xmx8172m com.sap.hana.spark.network.Launcher [root@hdpnode1 conf]# netstat -anp | grep LISTEN |grep 7860 tcp6 0 0 :::7860 :::* LISTEN 6064/java
Accessing data on HDP from SAP HANA database for analysis
Follow the steps below to access data stored on HDP from SAP HANA.
- Add the SAP HANA 2.0 server as a server in the SAP HANA Studio as shown in
Figure 3. Configuring SAP HANA Studio with a test SAP HANA server
- From HANA Studio, create a new remote connection to connect to Hive Server2
running on HDP and access Hive tables.
- In the HANA Studio window, click Open SQL Console for HANA system. The HANA SQL console is displayed on the right pane.
- Enter the following SQL query to create a new connection to HDP from SAP
HANA. Change the names and IP addresses as per your
CREATE REMOTE SOURCE "HDP Demo" ADAPTER "sparksql" CONFIGURATION 'port=7860;ssl_mode=disabled;server=xx.xx.xx.xx' WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hanaes;password=hanaes';
A new remote connection by name HDP Demo is created.
- Click the remote connection to explore the Hive databases and tables within it.
Figure 4. Create a new HDP remote connection to explore Hive databases and tables
- Create a new schema under HANA DB for Hive tables and create virtual tables. Run
the following SQL query under SQL Console. A new schema (named hdp)
will be created.
create SCHEMA hdp OWNED BY SYSTEM;
Figure 5. Creating a new schema
- Create HANA virtual tables (under the newly created schema) for the Hive tables
that you want to access and analyze data (as shown in Figure 6, Figure 7, and
Figure 6. Creating a virtual table on SAP HANA
Figure 7. Creating a virtual table on SAP HANA
Figure 8. Creating a virtual table on SAP HANA
- Access data through virtual tables and start analyzing. You can perform various
kinds of data visualization using the Visualization tab on HANA Studio.
Figure 9. Access data on Hadoop through HANA virtual tables
Figure 10. Perform queries on Hive tables through HANA virtual tables
Figure 11. Perform visualization on data from Hive tables through HANA virtual tables
Figure 12. Perform visualization on data from Hive tables through HANA virtual tables
Figure 13. Perform visualization on data from Hive tables through HANA virtual tables
This article explained how to set up a test environment where SAP HANA can interface with HDP running entirely on IBM Power Systems. SAP HANA was able to connect to and access data stored on HDFS through Hive tables. SAP HANA Studio was used as a client to access and visualize data from HDP. Two deployment models are available using scale-up and scale-out POWER8 servers.
- Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems
- Step by Step: Installation of HDP 2.6 on IBM Power Systems
- SAP HANA Spark Controller Installation Guide
- MovieLens data set citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.