This article is co-authored by Chris Snow (chris.snow@uk.ibm.com) and Pierre Regazzoni (pierrer@us.ibm.com).
IBM® BigInsights™ on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer® global cloud infrastructure. It supports a large variety of ways that you can process your data and integrate with other services.
So how do you start with a cloud data service such as IBM® BigInsights™ on Cloud and might be asking yourself:
- “how do I programmatically perform action X on my data on service Y?”
- “how do I programmatically move data between service Y and service Z?”
These questions usually need to be addressed early and quickly in the project’s life-cycle. As such, they are usually addressed during sprint zero to create the basic skeleton and plumbing for the project so that future sprints can truly add incremental value in an efficient way [1].
Starting from scratch, each question can easily take anywhere from a few hours to a few days to answer by the time you have researched the different options and developed some skeleton code.
In this blog post, we would like to introduce you to an open source project [2] that provides working examples of over 30 actions and integrations (and growing) that you can see running against your IBM® BigInsights™ cluster and your own services that you wish to integrate with. All you need to do is provide the connection details of your IBM® BigInsights™ cluster and the service(s) you wish to connect to and you can run a single command to see the example running against your environment. It is possible to set up the examples and run them in under five minutes!
The current examples are listed below:
Hdfs (Using Knox API – WebHDFS)
- List folder contents using Groovy
Create a folder using Groovy
Upload a file using Groovy
List folder contents using cURL
Create a folder using cURL
Upload a file using cURL
Ambari
- Get cluster name and then services installed on cluster
Perform HDFS Service Check via Ambari REST
BigR
- Connect to BigR
BigSQL
- Connect to Big SQL from Groovy
Insert/Select with Big SQL from Groovy
Load/Select with Big SQL from Groovy
Connect to Big SQL from Java
Hive
- Connect to Hive from Groovy
Connect to Hive from Java
Start a Hive Beeline Session
Spark (run inside a ssh session on the BigInsights cluster)
- Submit a spark python job
Submit a spark scala job
Spark Streaming (run inside a ssh session on the BigInsights cluster) Submit a spark streaming python job
Oozie (Using Knox API)
- Submit a Java Mapreduce job using Groovy
Submit a Java Mapreduce job using cURL
Submit a Java Spark job using Groovy
HBase
- Connect to HBase using Groovy
Manipulate Schema and Perform CRUD Operations using Groovy
Connect to HBase using Java
WebHCat/Templeton (Using Knox API)
- Execute a MapReduce Job using Groovy
Execute a Pig Job using Groovy
Execute a Hive Job using Groovy
Knox
- Run a knox shell client session
Cloudant
- Pull data from a Cloudant database to HDFS using Spark
Push data from HDFS to a Cloudant database using Spark
Object Store (Swift, S3)
- Pull data from a object store to HDFS using Spark
Push data from HDFS to a object store using Spark
dashDB
- Pull data from a dashDB database to HDFS using Spark
Push data to dashDB database using Spark
Pull data from a dashDB database using Big SQL
Elasticsearch
- Push data to Elasticsearch using Spark
Pull data from Elasticsearch to HDFS using Spark
Let’s look at a few use cases for moving data between BigInsights and dashDB. As you can see from the list above, there are three examples you can try:
- Pull data from a dashDB database to HDFS using Spark
- Push data to dashDB database using Spark
- Pull data from a dashDB database using Big SQL
The first two examples use Apache Spark to pull data from dashDB to BigInsights, and the second example uses Big SQL.
Running the examples is simple, you first need to checkout and setup the examples:
- Clone this repository
git clone https://github.com/snowch/biginsight-examples.git - Copy connection.properties_template to connection.properties
- Edit connection.properties to add your connection details for BigInsights and other optional services such as dashDB
- Export the cluster certificate from your browser
- In your connection.properties uncomment the line # known_hosts:allowAnyHosts
- Setup driver library by running:
./gradlew DownloadLibs(unix)gradlew.bat DownloadLibs(windows) to download libraries from the cluster
Now let’s just run one example:
Unix, run
./gradlew -p examples/DashDBIntegrationWithBigSQL Example
Windows, run
gradlew.bat -p examples/DashDBIntegrationWithBigSQL Example
The above command creates a table in hadoop and populates it with data from dashDB. If you developed the code for this from scratch, you could easily burn a few hours or a few days on it. Instead in around 5 minutes, you have been able to see some example code running against your own environment.
To run the whole set of examples at once, you can run:
./gradlew test (unix)
gradlew.bat test (windows)
(detailed output for the tests can be found in the folder ./build/test/).
For more information and get the code on the project, visit: https://github.com/snowch/biginsight-examples
We encourage you to also look at the code and provide comments/ideas on future example you’d like to see.
—
[2] https://github.com/snowch/biginsight-examples