Using data connectors to work with IBM Cloud Object Storage in IBM Spectrum Conductor with Spark 2.2.1

Technical Blog Post

Abstract

Body

Analytics and cognitive computing are providing some of the fastest growing use cases across industries for consuming large amounts of unstructured data. Spark is one of the most popular open source tools for data processing in the advanced analytics environment.

Spark Analytics can be run directly against IBM Cloud Object Storage. Traditionally, Spark jobs have run on local HDFS clusters, however, it is important to note that the use of public cloud object storage for analytics use cases is accelerating.

IBM Cloud Object Storage provides a flexible, scalable, and simple way to provide data storage for Spark users, as well as backup and/or archive for Hadoop-based data stores. In addition, it enables a much more cost effective solution for data consolidation and retention as compared to expensive HDFS deployments.

How data connectors in IBM Spectrum Conductor with Spark 2.2.1 fit in

Before this feature, if you wanted the hosts in your Spark instance group to be able to connect to an external data source (IBM Cloud Object Storage, GPFS, HDFS, etc.), there was a lot of extra configuration required on those hosts. In the context of IBM Spectrum Conductor with Spark 2.2.1, a data connector contains the type, the URI, the authentication method, and all of the required libraries to access the data source. The following is an example of a problem that users might run into:

You have 3 data sources on 1 Spark instance group. UserA’s application needs data1 and data3. UserB’s application needs data1 and data2.
Without our data connector implementation, your HADOOP_CONFIG_DIR needs to point to the location with the configuration files to your user’s particular data connector set.
Every user in this case would need their own configuration set up on the hosts that are used to access the data. This is impossible because there is only 1 configuration location per Spark instance group. A similar gap exists when multiple users access the same data source. The configurations are different between them.

Advantages of using data connectors

Data connectors greatly simplify data source administration and configuration, and separate the credential management and data usage from applications.

Instead, we dynamically load the configurations that is specified in the Spark instance group’s data connector and user settings.
Users select the data sources to which they need access at batch application submission time.

In addition, there are some advantages to using the IBM Cloud Object Storage data connector, specifically with its included “Stocator” connector package.

Advanced connector uniquely designed for object storage
Interacts with object store directly; does not use complex Hadoop modules
Supports analytic workflows
Implements Hadoop FileSystem interface
No need to modify Spark or Hadoop
Stocator-COS does not require local HDFS
As much as 18 times faster for write intensive workloads than built-in Hadoop object storage connectors*
Performs up 30 times fewer operations*
Less operations > Lower overhead > Lower Cost

(* based on performance testing with the S3a community connector, run with its default parameters: https://arxiv.org/abs/1709.01812)

More information on Stocator can be found at https://arxiv.org/abs/1709.01812.

Data connectors

You can configure multiple data connectors for a Spark instance group and switch between data sources on demand from the cluster management console. When data connectors are configured and deployed with your Spark instance group, you can use them to connect to data sources when you create notebook services and submit Spark batch applications.

Data Connector Type	Configuration Requirements
IBM Cloud Object Storage	S3 Endpoint URL: The endpoint URL used for the S3 connection. Bucket: The S3 connection bucket name. Set the fs.s3d.data_connector_name.access.key and fs.s3d.data_connector_name.secret.key properties within your application code.
IBM Spectrum Scale (HDFS Transparency)	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is 8020.
HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is 8020.
Kerberos secured HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is 8020. Data Access Principal: The data access principal. Keytab File Path: The path to the keytab file to access HDFS with Kerberos enabled. Principal Pattern: The pattern used to determine if the principal is valid.
Kerberos TGT secured HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is 8020. Principal Pattern: The pattern used to determine if the principal is valid.
Multi-User Concurrent access to embedded Derby metastore	This is a special configuration setting that can be turned on or off during Spark instance group registration or modification. When selected, a default data connector is added to enable concurrent user access to be able to run Spark SQL using Derby for the embedded metastore. If not selected, the Derby embedded metastore works only for the first execution user. Hive requires a metastore for centralized metadata. By default, SparkSQL uses an embedded Derby for Hive metastore. SparkSQL might generate db files and store them into working directory. Some problems are that embedded Derby does not allow concurrent access, even for the same user; embedded Derby working files have ACL issues with multiple users; and the working directory is shared by multiple applications (e.g. notebook instances) might clash with each other. This feature isolates default embedded Derby metastore on a per-application (process) basis. This feature isolates the directory for SQL db files on a per-application (process) basis.

The available data connectors can be viewed in one of the new RESTful APIs:

[CONDUCTOR_REST_HOST]:[PORT]/platform/rest/conductor/v1/dataconnectors

Data connectors can be added or removed from Spark instance groups during creation or modification, under a new Data Connectors tab in the wizard.

Here is an example of the creation of an IBM Cloud Object Storage data connector.

Next, users can optionally make some selections about which data connectors are active and which one to use as default. For batch applications, these selections are used as default, but can be changed at run time. For notebook applications, these values are always used.

You can also run status checks in IBM Spectrum Conductor with Spark 2.2.1 for you data connectors to determine the connection status for your external file system. Status checks can be run from three places in the cluster management console:

From the Data Connectors tab of the Spark instance group view page.
From My Applications & Notebooks, select Open Notebook > Notebook data connectors.
From the Data Connectors heading in the Spark application view. Note that from here, you can only check the status of a data connector before the application has finished running.

Here is what status checking in the cluster management console looks like:

After you test the connection, the status will show as ‘Available’, or ‘Error’. If an error has been detected, you can view the error log, and you can retry the status check.

When running a Spark batch application, you can now see a checkbox to enable data connectors. Your default selections from the Spark instance group are applied here, but you can make changes if necessary. You can decide which data connector to use as fs.defaultFS, which can be accessed by “/” in your application code.

The application shown above uses the GPFS data connector for “/”, but multiple file systems can be accessed explicitly in the application. Here is an example of a notebook application. As you can see, the default data connector is being used to access /README.md, but the IBM Cloud Object Storage file system is also being accessed via its own access URI. As noted in the configuration requirement table above, the access key and secret key are specified inside the application.

For information on the data connector RESTful APIs, access the RESTful API documentation in your cluster:

http://<hostname>:8280/cloud/apis/explorer/#/DataConnectors

https://<hostname>:8643/cloud/apis/explorer/#/DataConnectors

Key reasons for using IBM Cloud Object Storage with Spark applications

Capability to store huge amounts of unstructured data in any format
Resilient data store – storage data will not be lost
Object store designed to operate during failures
Various security models – data is secure and protected
Object stores can be easily accessed for write or read flows
On premise, cloud based, and hybrid
Analytic job results can be persisted in the object storage
Object stores allow ability to easily share data subsets
Significant data storage TCO savings

If you want to know more about IBM Cloud Object Storage and data connectors, or if have questions about this blog or its content, post them in our forum! Or, you can also connect with us on our Slack channel!

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163557

Tips

Using data connectors to work with IBM Cloud Object Storage in IBM Spectrum Conductor with Spark 2.2.1

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?