Data Analytics

Access and Analyze data in IBM Cross Region Cloud Object Storage

Share this post:

IBM Analytics for Apache Spark and IBM Data Science Experience can now access and do analytics on data stored in the IBM Cloud Object Storage Cross Region service.  IBM Cloud Object Storage (COS) offers high-capacity, cost-effective storage for analytics and other applications that is scalable, flexible and simple to use.  It offers cross region support to protect data against regional outages without the cost and complexity of multiple copies.

Apache Spark uses an object storage connector to access data in an object store. The existing Hadoop open source connectors are designed to work with file systems rather than object stores and therefore perform operations, e.g., creating temporary directories and files and then renaming them, that are not native to object stores leading to dozens of useless requests.  IBM has created the Stocator  connector to overcome these limitations and take advantage of object storage semantics.  The connector allowing IBM Analytics for Apache Spark and IBM Data Science Experience to access data in the IBM COS Cross Region service is based on Stocator technology.

In the remainder of this document we describe how you can get started using IBM COS Cross Region with IBM Analytics for Apache Spark (Bluemix Spark service) and the IBM Data Science Experience (DSX).  We assume that you already have a Bluemix account and know how to use the Bluemix Spark service and DSX.  IBM Cloud Storage Documentation with details on how to get started can be found here.

To obtain IBM COS Cross Region storage your IBM Cloud Bluemix account must have a credit card or other payment authorization associated with it.From your account you should subscribe to the IBM Cloud Object Storage cross region service S3 API.  Note that a free tier of 25GB can be obtained by entering an appropriate promotion code.

After an account is established a bucket must be created to store objects. A bucket can be created through a Bluemix Infrastructure storage panel or using REST APIs.

To access your newly created bucket you will need the credentials for your object storage account.  From the Bluemix Infrastructure object storage panel, click on your object storage account.This opens the administration panel for this IBM COS account,which shows the name of the account,the amount of storage currently being consumed, and the public bandwidth used so far.  Click on “Show Credentials”on the left side of the panel and then on the “Show” link next to“Credentials”.  The figure below shows the panel.

You can run a Spark application that uses data in your CR COS bucket through a notebook in DSX or through Spark submit.  Below we show a Spark application that you can run in a notebook.  Please note that to access data in CR COS you need to be running Spark 2.0 or higher.

First you need to set up the credentials and the endpoint for the COS Cross Region service in the Hadoop configuration so the COS connector will be able to access the data in your bucket.  The endpoint should be the private endpoint for Dallas as shown.  The access key and secret key come from the credentials as shown above.  Notice that the keys in the Hadoop configuration are prefixed by “fs.s3d”.

# Set up the credentials and endpoint for the IBM Cloud Object Storage connector
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://s3-api.dal-us-geo.objectstorage.service.networklayer.com”)
hconf.set("fs.s3d.service.access.key", "your access key goes here")
hconf.set("fs.s3d.service.secret.key", "your secret key goes here")

Now you can write your application.  Here we show an application for WordCount.  Notice that the URIs for the input and output datasets use the “s3d” prefix.  This indicates that the COS connector should be used to access the data in COS.  Both the input and output datasets are in the bucket called “my_bucket”.

# Example code for WordCount
# Input: a text file
# Output: a list of unique words and a count of how many times they appear
inputDataset = "s3d://my_bucket.service/myText.txt"
outputDataset = "s3d://my_bucket.service/wordcountResults"

lines = sc.textFile(inputDataset, 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile(outputDataset)
More Storage Stories

The Power of Profiling: How automated classification aids data governance

Intelligent, flexible classification capabilities within IBM Watson Data Platform help users govern, prepare, integrate and analyze data quickly

Continue reading

You’ve built your data catalog – now what?

How an integrated data science platform can transform your ability to make productive use of big data Leverage an integrated data and analytics toolset that makes data science smooth and seamless Unlock self-service big data analytics for users of all skill levels – from data scientists to citizen analysts Get a new end-to-end data science workflow up and running within a day

Continue reading

IBM Data Catalog Now Generally Available

We hope you have been having a great experience discovering, cataloguing and governing data with IBM Data Catalog as part of IBM Watson Data Platform. We’d like to inform you that the Data Catalog service is now generally available (GA), and all Beta plan instances will be retired on January 31, 2018.

Continue reading