Data Analytics

Access and Analyze data in IBM Cross Region Cloud Object Storage

Share this post:

IBM Analytics for Apache Spark and IBM Data Science Experience can now access and do analytics on data stored in the IBM Cloud Object Storage Cross Region service.  IBM Cloud Object Storage (COS) offers high-capacity, cost-effective storage for analytics and other applications that is scalable, flexible and simple to use.  It offers cross region support to protect data against regional outages without the cost and complexity of multiple copies.

Apache Spark uses an object storage connector to access data in an object store. The existing Hadoop open source connectors are designed to work with file systems rather than object stores and therefore perform operations, e.g., creating temporary directories and files and then renaming them, that are not native to object stores leading to dozens of useless requests.  IBM has created the Stocator  connector to overcome these limitations and take advantage of object storage semantics.  The connector allowing IBM Analytics for Apache Spark and IBM Data Science Experience to access data in the IBM COS Cross Region service is based on Stocator technology.

In the remainder of this document we describe how you can get started using IBM COS Cross Region with IBM Analytics for Apache Spark (Bluemix Spark service) and the IBM Data Science Experience (DSX).  We assume that you already have a Bluemix account and know how to use the Bluemix Spark service and DSX.  IBM Cloud Storage Documentation with details on how to get started can be found here.

To obtain IBM COS Cross Region storage your IBM Cloud Bluemix account must have a credit card or other payment authorization associated with it.From your account you should subscribe to the IBM Cloud Object Storage cross region service S3 API. If you’d like to learn more, you can read about the features and benefits of IBM Cloud Object Storage here. Note that a free tier of 25GB can be obtained by entering an appropriate promotion code.

After an account is established a bucket must be created to store objects. A bucket can be created through a Bluemix Infrastructure storage panel or using REST APIs.

To access your newly created bucket you will need the credentials for your object storage account.  From the Bluemix Infrastructure object storage panel, click on your object storage account.This opens the administration panel for this IBM COS account,which shows the name of the account,the amount of storage currently being consumed, and the public bandwidth used so far.  Click on “Show Credentials”on the left side of the panel and then on the “Show” link next to“Credentials”.  The figure below shows the panel.

You can run a Spark application that uses data in your CR COS bucket through a notebook in DSX or through Spark submit.  Below we show a Spark application that you can run in a notebook.  Please note that to access data in CR COS you need to be running Spark 2.0 or higher.

First you need to set up the credentials and the endpoint for the COS Cross Region service in the Hadoop configuration so the COS connector will be able to access the data in your bucket.  The endpoint should be the private endpoint for Dallas as shown.  The access key and secret key come from the credentials as shown above.  Notice that the keys in the Hadoop configuration are prefixed by “fs.s3d”.

# Set up the credentials and endpoint for the IBM Cloud Object Storage connector
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.s3d.service.endpoint", "http://s3-api.dal-us-geo.objectstorage.service.networklayer.com”)
hconf.set("fs.s3d.service.access.key", "your access key goes here")
hconf.set("fs.s3d.service.secret.key", "your secret key goes here")

Now you can write your application.  Here we show an application for WordCount.  Notice that the URIs for the input and output datasets use the “s3d” prefix.  This indicates that the COS connector should be used to access the data in COS.  Both the input and output datasets are in the bucket called “my_bucket”.

# Example code for WordCount
# Input: a text file
# Output: a list of unique words and a count of how many times they appear
inputDataset = "s3d://my_bucket.service/myText.txt"
outputDataset = "s3d://my_bucket.service/wordcountResults"

lines = sc.textFile(inputDataset, 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile(outputDataset)
David Wohlford

WW Cloud Storage Portfolio and Solutions Marketing Manager IBM Hybrid Cloud

More Data Analytics stories

Actiance solution on IBM Cloud and Watson APIs keeps business communications compliant and secure

Actiance, Inc. is on a mission to enable the security, management and compliance of unified communications, Web 2.0 and social media channels for corporate customers worldwide. With IBM Cloud as a key partner and enabler, Actiance helps support and manage more than 500 million daily conversations across 80 communications channels. Customers include the top 10 U.S., top 5 Canadian, top 8 European, and top 3 Asian banks. A key Actiance advantage is that its platforms help its customers stay ahead of compliance and uncover patterns and relationships hidden within their data, so they can be more proactive and less reactive.

Continue reading

How to Layout Big Data in IBM Cloud Object Storage for Spark SQL

When you have vast quantities of rectangular data, the way you lay it out on object storage systems like IBM Cloud Object Storage (COS) makes a big difference to both cost and performance of SQL queries. However, this task is not as simple as it sounds. Here we survey some tricks of the trade.

Continue reading

New features for IBM Cloudant

Over the past few weeks, we have been rolling out the ability for IBM Cloudant users to have more control over their Cloudant Standard plans and provisioned throughput capacity. Customers will have the ability to more accurately set the provisioned throughput capacity according to the needs of their applications. The Standard plan is backed by 99.95% SLA at all prices points and changes to provisioned throughput capacity are usually available within seconds.

Continue reading