Contents


Analyze data faster using Spark and IBM Cloud Object Storage

Using Stocator: an open-source storage connector that leverages object store semantics

Comments

Data is being produced at dizzying rates across industries, including for genomics by sequencers, for media and entertainment with very high resolution formats, and for Internet of Things (IoT) by multitudes of sensors. IBM Cloud Object Storage (IBM COS, formerly Cleversafe) technology offers high-capacity, cost-effective storage for these applications. But it is not enough only to store the data; you also need to derive value from it, which you can do using Apache Spark, the leading big data analytics processing engine. Spark runs up to 100 times faster than Hadoop MapReduce, and it combines SQL, streaming, and complex analytics.

This article describes how to enable Spark to run analytics over data stored in IBM COS. We describe how to use Stocator, which is open-source software that acts as a driver, and OpenStack Keystone, which provides authentication. Stocator takes advantage of object store semantics and greatly improves performance compared to the previous storage connectors for Spark that were designed to work with file systems. Stocator employs JOSS, an open-source Java client, to produce the HTTP REST commands to access IBM COS through its OpenStack Swift interface.

The triangular relationship between IBM COS, Stocator, and OpenStack Keystone is illustrated in the figure below.

IBM COS, Stocator, and OpenStack Keystone making up a triangle
IBM COS, Stocator, and OpenStack Keystone making up a triangle
1

Install and configure Spark

Download Spark. Instructions on how to build, install, and configure Spark are available on the Spark web site. According to your set up, you can configure Spark as a standalone machine or on a cluster using YARN, Mesos, or Spark's standalone cluster manager. In our example, we used IBM COS with Spark 2.0.1.

2

Install and configure IBM COS

Install Cloud Object Storage (COS). In our example, we set up IBM COS with Keystone authentication.

3

Install and configure Stocator

To access IBM COS from Spark, we used open-source driver software called Stocator. Stocator is high-performance connector-to-object storage for Spark that leverages object store semantics. It comes with a complete driver for the OpenStack Swift API and can easily be extended to support other object storage interfaces. We used Stocator's capability to connect Spark with IBM COS through its Swift API.

To use Stocator, complete the following steps.

  1. Download the source code from https://github.com/SparkTC/stocator, and fork or clone it using git.
  2. Build Stocator by entering mvn clean package –Pall-in-one from Stocator's directory.

To configure Spark to take advantage of Stocator to access IBM COS, define Stocator and its settings. Choose from two ways to configure Stocator:

Add parameters to the configuration file

Create your core-site.xml configuration file by doing one of the following:

Use the core-site.xml.template file from your Stocator/conf directory

In the Configuration Files section, access the example of core-site.xml configured for use with Keystone-based authentication by entering the following:

Listing 1. Access the core-site.xml.template file
ofer@beginnings:~$ cd ~/stocator/conf
ofer@beginnings:~/stocator/conf$ cp core-site.xml.template ~/spark-2.0.1/conf/core-site.xml

Use our Keystone Version 2 configuration file

For Keystone Version 2, you can use this core-site.xml with the substitutions described below the listing.

Listing 2. Keystone Version 2 configuration file
<configuration>
<property>
   <name>fs.swift2d.impl</name>
   <value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>

<!-- Keystone based authentication -->
<property>
    <name>fs.swift2d.service.spark.auth.url</name>
    <value>http://your.keystone.server.com:5000/v2.0/tokens</value>
</property>
<property>
    <name>fs.swift2d.service.spark.public</name>
    <value>true</value>
</property>
<property>
    <name>fs.swift2d.service.spark.tenant</name>
    <value>service</value>
</property>
<property>
    <name>fs.swift2d.service.spark.password</name>
    <value>passw0rd</value>
</property>
<property>
    <name>fs.swift2d.service.spark.username</name>
    <value>swift</value>
</property>
<property>
    <name>fs.swift2d.service.spark.auth.method</name>
    <value>keystone</value>
</property>
<property>
    <name>fs.swift2d.service.spark.region</name>
    <value>IBMCOS</value>
</property>

</configuration>
  • Replace your.keystone.server.com with the real address of the Keystone server.
  • Replace all the authentication credentials (tenant, username, and password) with the valid credentials for your object store.
  • When using a global Keystone, the region property needs to be defined according to the Keystone region defined for IBM COS access.

Use our Keystone Version 3 configuration file

For Keystone Version 3, you can use this core-site.xml with the substitutions described below the listing. Remember that for Keystone Version 3, userID and tenantID are used instead of username and tenant values.

Listing 3. Keystone Version 3 configuration file
<configuration>
<property>
    <name>fs.swift2d.impl</name>
    <value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>

<!-- Keystone based authentication -->
<property>
    <name>fs.swift2d.service.spark.auth.url</name>
    <value>http://your.keystone.server.com:5000/v3/auth/tokens</value>
</property>
<property>
    <name>fs.swift2d.service.spark.public</name>
    <value>true</value>
</property>
<property>
    <name>fs.swift2d.service.spark.tenant</name>
    <value>1c5c9e97c8db488baeca8d667497aef7</value>
</property>
<property>
    <name>fs.swift2d.service.spark.password</name>
    <value>passw0rd</value>
</property>
<property>
    <name>fs.swift2d.service.spark.username</name>
    <value>d2a2adb8bd924c2da1545e2e9ee7c4fe</value>
</property>
<property>
    <name>fs.swift2d.service.spark.auth.method</name>
    <value>keystoneV3</value>
</property>
<property>
<name>fs.swift2d.service.spark.region</name>
    <value>IBMCOS</value>
</property>

</configuration>
  • Replace your.keystone.server.com with the real address of the Keystone server.
  • Replace all the authentication credentials (tenant, username, and password) with the valid credentials for your object store. Remember that for Keystone Version 3, userID and tenantID are used instead of username and tenant values.
  • When using a global Keystone, the region property needs to be defined according to the Keystone region defined for IBM COS access.

Specify the parameters programmatically in your code

If you prefer to specify the parameters in your code, use the following code sample, where the SERVICE_NAME is spark.

Listing 4. Add configuration parameters to your code
hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.swift2d.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
hconf.set("fs.swift2d.service.spark.auth.url", "http://your.authentication.server.com/v2.0/tokens")
hconf.set("fs.swift2d.service.spark.public", "true")
hconf.set("fs.swift2d.service.spark.tenant", "service")
hconf.set("fs.swift2d.service.spark.username", "swift")
hconf.set("fs.swift2d.service.spark.auth.method", "keystone")
hconf.set("fs.swift2d.service.spark.password", "passw0rd")
hconf.set("fs.swift2d.service.spark.region", "IBMCOS")

Table 1 describes each of these parameters.

Table 1. Configuration parameters
KeyDescription
fs.swift2d.service.SERVICE_NAME.impl The path to the driver class inside the jar
fs.swift2d.service.SERVICE_NAME.auth.url The URL of the Keystone service
fs.swift2d.service.SERVICE_NAME.public Defines whether access will be done over public (true) or private (false) networks. The default is false.
fs.swift2d.service.SERVICE_NAME.tenant The object store tenant name
fs.swift2d.service.SERVICE_NAME.password The object store user password
fs.swift2d.service.SERVICE_NAME.username The object store user name
fs.swift2d.service.SERVICE_NAME.block.size The data read/write block size in MB. The default is 128MB.
fs.swift2d.service.SERVICE_NAME.region Keystone region, for global Keystone use case
fs.swift2d.service.SERVICE_NAME.auth.method The authentication method. Set to keystone for Keystone Version 2 or to keystoneV3 for Keystone Version 3.

The SERVICE_NAME should be replaced by your service name, such as spark.

4

Launch the Stocator-enabled Spark

Before you take advantage of Stocator to access your IBM COS objects from Spark, you need to either statically recompile Spark with Stocator or dynamically pass Stocator's library to Spark.

To recompile Spark from the source to include the Stocator driver, see the instructions the github repository for Stocator.

To instead use Stocator's standalone jar library to use Spark without re-compiling it, run Spark with the –jars option. In our environment, we used version 1.0.8 of Stocator, so the standalone jar library name is stocator-1.0.8-SNAPSHOT-jar-with-dependencies.jar. The options to pass to Spark in our example environment are: –jars stocator-1.0.8-SNAPSHOT-jar-with-dependencies.jar.

Listing 5. Jars options
ofer@beginnings:~$ ~/spark-2.0.1/bin/spark-shell \
                   --jars stocator-1.0.8-SNAPSHOT-jar-with-dependencies.jar
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_111)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
5

Access IBM COS objects from Spark

Once Stocator is enabled on Spark, access IBM COS objects from Spark using the schema swift2d://<container>.<service>/. The swift2d keyword indicates to Spark which driver to use in order to access storage. It indicates that you are accessing an object store using Stocator. The container and service are discussed in more detail in the next section.

For example, the following python code will read a json object named data.json from IBM COS and write it back as a parquet object named data.parquet.

Listing 6. Access the IBM COS objects
df = sqlContext.read.json("swift2d://vault.spark/data.json”) 
df.write.parquet("swift2d://vault.spark/data.parquet”)
6

Test the connection between Spark and IBM COS

To test the connection between Spark and IBM COS, we used a simple python script that distributes a single list of six elements across the spark cluster, writes the data out into a parquet object, and finally reads it back in. The name of the parquet object is passed into the script as a parameter.

The data is displayed twice: the first time, together with its schema, before it is written to the object store; and the second time after it is read back from the object store.

Listing 7. Python script to test the connection
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys

sc = SparkContext()
sqlContext = SQLContext(sc)

if (len(sys.argv) != 2):
    print "ERROR: This program takes object name as input"
    sys.exit(0)

objectName = sys.argv[1]

myList = [[1,'a'],[2,'b'],[3,'c'],[4,'d'],[5,'e'],[6,'f']]
parallelList = sc.parallelize(myList).collect()
schema = StructType([StructField('column1', IntegerType(), False),
                     StructField('column2', StringType(), False)])
df = sqlContext.createDataFrame(parallelList, schema)
df.printSchema()
df.show()
dfTarget = df.coalesce(1)
dfTarget.write.parquet("swift2d://vault.spark/" + objectName)
dfRead = sqlContext.read.parquet("swift2d://vault.spark/" + objectName)
dfRead.show()
print "Done!"

To run the script, complete the following steps:

  1. Save the code in Listing 7 as a file called sniff.test.py.
  2. Create a container named vault.
  3. Set the service (the term that appears after the container name in the url) to be equal to the SERVICE_NAME defined in the core-site.xml file (remember that we used spark for our example).
  4. Issue the following command, where testing.parquet is the name of the object that will be created and then read: spark-submit --jars stocator-1.0.8-SNAPSHOT-jar-with-dependencies.jar sniff.test.py testing.parquet.

You should see a testing.parquet object in IBM COS in addition to the following Spark output:

Listing 8. Spark results confirming connection
 root                                                                            
 |-- column1: integer (nullable = false)
 |-- column2: string (nullable = false)

+-------+-------+
|column1|column2|
+-------+-------+
|      1|      a|
|      2|      b|
|      3|      c|
|      4|      d|
|      5|      e|
|      6|      f|
+-------+-------+

+-------+-------+
|column1|column2|
+-------+-------+
|      1|      a|
|      2|      b|
|      3|      c|
|      4|      d|
|      5|      e|
|      6|      f|
+-------+-------+

Done!

Conclusion

By configuring Spark, Stocator, and IBM Cloud Object Storage to work together, you can access and analyze your stored data faster using object store semantics rather than the older storage connectors that were designed to work with file systems.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=1041050
ArticleTitle=Analyze data faster using Spark and IBM Cloud Object Storage
publish-date=12222016