Share this post:
IBM Analytics for Apache Spark and IBM Data Science Experience can now access and do analytics on data stored in the IBM Cloud Object Storage Cross Region service. IBM Cloud Object Storage (COS) offers high-capacity, cost-effective storage for analytics and other applications that is scalable, flexible and simple to use. It offers cross region support to protect data against regional outages without the cost and complexity of multiple copies.
Apache Spark uses an object storage connector to access data in an object store. The existing Hadoop open source connectors are designed to work with file systems rather than object stores and therefore perform operations, e.g., creating temporary directories and files and then renaming them, that are not native to object stores leading to dozens of useless requests. IBM has created the Stocator connector to overcome these limitations and take advantage of object storage semantics. The connector allowing IBM Analytics for Apache Spark and IBM Data Science Experience to access data in the IBM COS Cross Region service is based on Stocator technology.
In the remainder of this document we describe how you can get started using IBM COS Cross Region with IBM Analytics for Apache Spark (Bluemix Spark service) and the IBM Data Science Experience (DSX). We assume that you already have a Bluemix account and know how to use the Bluemix Spark service and DSX. IBM Cloud Storage Documentation with details on how to get started can be found here.
To obtain IBM COS Cross Region storage your IBM Cloud Bluemix account must have a credit card or other payment authorization associated with it.From your account you should subscribe to the IBM Cloud Object Storage cross region service S3 API. If you’d like to learn more, you can read about the features and benefits of IBM Cloud Object Storage here. Note that a free tier of 25GB can be obtained by entering an appropriate promotion code.
After an account is established a bucket must be created to store objects. A bucket can be created through a Bluemix Infrastructure storage panel or using REST APIs.
To access your newly created bucket you will need the credentials for your object storage account. From the Bluemix Infrastructure object storage panel, click on your object storage account.This opens the administration panel for this IBM COS account,which shows the name of the account,the amount of storage currently being consumed, and the public bandwidth used so far. Click on “Show Credentials”on the left side of the panel and then on the “Show” link next to“Credentials”. The figure below shows the panel.
You can run a Spark application that uses data in your CR COS bucket through a notebook in DSX or through Spark submit. Below we show a Spark application that you can run in a notebook. Please note that to access data in CR COS you need to be running Spark 2.0 or higher.
First you need to set up the credentials and the endpoint for the COS Cross Region service in the Hadoop configuration so the COS connector will be able to access the data in your bucket. The endpoint should be the private endpoint for Dallas as shown. The access key and secret key come from the credentials as shown above. Notice that the keys in the Hadoop configuration are prefixed by “fs.s3d”.
# Set up the credentials and endpoint for the IBM Cloud Object Storage connector
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.access.key", "your access key goes here")
hconf.set("fs.s3d.service.secret.key", "your secret key goes here")
Now you can write your application. Here we show an application for WordCount. Notice that the URIs for the input and output datasets use the “s3d” prefix. This indicates that the COS connector should be used to access the data in COS. Both the input and output datasets are in the bucket called “my_bucket”.
# Example code for WordCount
# Input: a text file
# Output: a list of unique words and a count of how many times they appear
inputDataset = "s3d://my_bucket.service/myText.txt"
outputDataset = "s3d://my_bucket.service/wordcountResults"
lines = sc.textFile(inputDataset, 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \