vStorm Enterprise integrated with Hortonworks Data Platform (HDP) running on IBM Power Systems
Steps for moving multi-sourced data using vStorm to HDP on IBM Power Systems
Veristorm provides a solution called vStorm Enterprise that makes data migration to Hadoop environments flexible, secure, and easy. vStorm already supports data movement to Hadoop solutions running on Linux on IBM® Power Systems™. Validation testing was performed to verify vStorm's ability to integrate with and move data specifically to Hortonworks Data Platform (HDP) 2.6 on IBM POWER8® processor-based servers. Here is a brief introduction to the basic capabilities of vStorm Enterprise that are important for IBM Power Systems, followed by information about the validation test that was completed.
Veristorm vStorm Enterprise
Veristorm offers a solution called vStorm Enterprise for Hadoop that provides fast and seamless point-and-click data migration from many types of data sources to Hadoop clusters and other destinations. The vStorm Connect connector component securely transfers data at high-speed and offers batch scheduling options. vStorm can move data to the following Hadoop targets:
- Hadoop Distributed File System (HDFS): The connector interfaces to the HDFS name node and moves data as comma-separated values (CSVs) or Avro files directly into HDFS. Avro files are compressed to reduce storage requirements.
- Hive: Metadata is written to the Hive server, reflecting the source data schema, and file data is moved to HDFS to make data available on Hadoop for HiveQL queries. Hive is not a data format rather it enables a schema to be imposed on existing data in HDFS so that it can be queried by using HiveQL, an SQL-like language. Data that exists in Hive is already stored in HDFS.
- Linux file system: The connector can transfer data directly into a Linux file system. Data written to a local file system can be used by downstream ETL and analytics tools and applications. This flexibility is important when clients want to move data not only to Hadoop but to other environments as well.
For more information about vStorm Enterprise, refer to the Veristorm website.
The key objectives for the validation testing of vStorm Enterprise were to verify whether it can successfully connect to an HDP Hadoop cluster running on an IBM POWER8 processor-based server and successfully move data to and from HDP. More specifically, the following key tests were run:
- Configuring vStorm to connect to HDP 2.6 running on an IBM POWER8-processor based server
- Moving sample data from a data source on an x86-based server, where vStorm is hosted, to the HDFS of HDP running on a POWER8-processor based server, and moving data from HDP to the original data source
- Moving sample data from a Linux file system on the HDP cluster to the HDFS of HDP running on a POWER8-based server, and moving data from HDP to a Linux file system on HDP
This section lists the high-level components of Veristorm and HDP used in the test environment.
- vStorm Enterprise 3.0 including the vStorm Connect connector component for HDP
- Red Hat Enterprise Linux (RHEL) 7.2
- Virtual machine on x86 processor-based server
Hortonworks Data Platform
- HDP version 2.6
- RHEL 7.2
- Minimum resources: Eight virtual processor, 24 GB memory, 50 GB disk space
- IBM PowerKVM™
- IBM POWER8 processor-based server
Figure 1 describes the high-level architecture for Veristorm vStorm Enterprise. vStorm supports moving data from many types of data sources (shown on the left side) to many data targets (shown on the right side). The tests performed with IBM Power Systems used data in an x86 processor-based Linux file system and a POWER8 processor-based HDP Linux file system as data sources and a POWER8 processor-based HDP HDFS and a POWER8 processor-based HDP Linux file system as the data targets.
Figure 1. High-level architecture for Veristorm vStorm Enterprise
Veristorm vStorm Enterprise has three major components: Management console, vStorm Data Hub, and vStorm Connect.
- Management console is the graphical user interface (GUI) that manages the user interactions with the data source and target systems.
- vStorm Data Hub is the main processing engine of vStorm. It runs on Linux and uses vStorm Connect to access the source data through vStorm Connect agents deployed on the database environment. As the data is streamed to vStorm Data Hub, it first does any specified data conversion, and then transfers the data to the target, for example, a big data platform such as Hadoop.
- vStorm Connect is the connector that establishes communication between data sources and data targets and communicates with vStorm Hub as data is transferred.
Installation and configuration
The section covers the installation and configuration of a HDP cluster and vStorm software.
Installing the HDP cluster
Refer to the following high-level steps to install and configure the HDP cluster:
- Follow the installation guide for HDP on Power Systems (see Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems) to install and configure the HDP cluster.
- Log in to the Ambari server and ensure that all the services are running.
- Monitor and manage the HDP cluster, Hadoop, and related services through Ambari.
Installing prerequisites for vStorm Enterprise
Follow the instructions in the installation guide provided by Veristorm. Ensure that the necessary hardware and software requirements are met. For this test, vStorm was installed on RHEL 7.2 on a POWER8 processor-based server. The following list shows the requirements at the time the test was run.
- RHEL 6.x, Cent OS 6.x, or SLES 11.x
- Java 1.7 or later
- Tomcat6 6.0.18 or later (Tomcat installed as a tomcat user)
- PostgresSQL 8.4 or later with postgresql-contrib and postgresql-jdbc
- Root access to Linux file system. All vStorm Enterprise installation and setup operations are done as a root user.
- SELinux disabled (not enforcing).
- Mozilla Firefox 44.x or Microsoft Internet Explorer 11.x for client browser on Microsoft Windows with JRE 1.8 or later.
Configuring and initializing the database
After the vStorm software prerequisites were installed, the database was configured. The following commands were used to configure and start the database:
- Configure the PostgreSQL database using the following command:
# su – postgres
- Check the PostgreSQL configuration file to ensure that three specific settings
have the correct values. Use the following two commands to display the three
settings and ensure that they match the values shown below.
$ sudo cat /var/lib/pgsql/data/postgresql.conf | grep -e listen -e standard_conforming_strings listen_addresses = '*' standard_conforming_strings = off $ sudo cat /var/lib/pgsql/data/pg_hba.conf | grep -e host host all all 0.0.0.0/0 trust
- Initialize and start the PostgreSQL database using commands similar to the
# su - postgres $ sudo service postgresql initdb $ sudo service postgresql start
Installing vStorm Enterprise
Install vStorm using the following command:
# yum install vse-2.4-0.x86_64.rpm
Setting up vStorm Enterprise
You can set up vStorm Enterprise in two modes: Traditional or standalone. The traditional mode deploys the full UI-driven capabilities for data movement from sources to targets. "Standalone mode" provides offline job scheduling through a scheduler used in our deployment. In addition, vStorm Enterprise can be deployed as a primary node on a single server or can be deployed on multiple servers as a secondary node with all pointing to the same primary node, which allows load balancing among multiple servers. For more details on these different options, read the vStorm Enterprise User guide provided by Veristorm.
To set up vStorm in the traditional mode, change to the correct directory, and run the following setup command:
# cd /opt/vse/sbin/ # ./setup_vhub.sh
The command will prompt for the IP address of the source and destination systems, the data base credentials, management console credentials, and other values. Enter the required values as shown in the following output.
To start vStorm, use the following command:
Launching the management console
Once the installation and setup is complete, launch the management console for vStorm by entering the IP address along with the port number 8080 (Tomcat runs on 8080 port).
The screen shown in Figure 2 will appear. Log in with the credentials of "admin" user which was created at the time of vStorm setup.
Figure 2. vStorm Enterprise Management Console log in page
If a dialogue box (as shown in Figure 3 appears), click Run to start the management console. After this completes, the management console will appear as shown in Figure 4.
Figure 3. Prompt before starting the vStorm Enterprise Management Console
Figure 4. vStorm Enterprise Management Console
Test data and data transfer
Two different sample sets of data were used in the testing.
Download and load the data into vStorm using the following steps:
- Using the
wgetcommand with the sample data links above, download the data to the server where vStorm is installed.
- After loading the data within vStorm, select the required data in the data source view in the vStorm console.
- Right-click the selection and then click Copy to as shown in the
Figure 5. Options for copying the data in vStorm console
- A dialogue box to select the target, where the data needs to be transferred (as
shown in Figure 6), is displayed. Click Next.
Figure 6. Selecting the target
A dialogue box with the job ID and job description is displayed. The job can also be scheduled for a specified time by providing the start time as shown in the Figure 7.
Figure 7. Job scheduler
- Click Finish to transfer the data to the data target.
You can monitor the data transfer through the vStorm console. Refer to Figure 8 to see how the status of the jobs appear during the transfer process.
Figure 8. Reviewing job status in the vStorm management console
Four test scenarios were considered in this process. Each test was initiated from the vStorm Enterprise Management Console by selecting the data source on the left side of the console and selecting the target on the right side. Each test transferred data from the source to the target system in one direction only.
Refer to Table 1 for the test scenarios.
Table 1. Tests initiated from the vStorm Enterprise Management Console
|Test 1||Linux file system on an x86 processor-based server||HDFS on the HDP cluster on a POWER8 processor-based server||Management console in Figure 9|
|Test 2||HDFS on the HDP cluster on a POWER8 processor-based server||Linux file system on an x86 processor-based server||Management console in Figure 10|
|Test 3||Linux file system on a POWER8 processor-based server||HDFS on the HDP cluster on the same POWER8 processor-based server||Management console in Figure 11|
|Test 4||HDFS on the HDP cluster on a POWER8 processor-based server||Linux file system on the same POWER8 processor-based server||Management console in Figure 12|
Figure 9. vStorm Enterprise Management Console for Test 1
Figure 10. vStorm Enterprise Management Console for Test 2
Figure 11. vStorm Enterprise Management Console for Test 3
Figure 12. vStorm Enterprise Management Console for Test 4
- Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems
- Veristorm vStorm Enterprise
- Census house database
- Million songs subset data base
- ISV solution ecosystem for Hortonworks on IBM Power Systems