Technical Blog Post
Abstract
Using data connectors to work with IBM Cloud Object Storage in IBM Spectrum Conductor with Spark 2.2.1
Body
Analytics and cognitive computing are providing some of the fastest growing use cases across industries for consuming large amounts of unstructured data. Spark is one of the most popular open source tools for data processing in the advanced analytics environment.
Spark Analytics can be run directly against IBM Cloud Object Storage. Traditionally, Spark jobs have run on local HDFS clusters, however, it is important to note that the use of public cloud object storage for analytics use cases is accelerating.
IBM Cloud Object Storage provides a flexible, scalable, and simple way to provide data storage for Spark users, as well as backup and/or archive for Hadoop-based data stores. In addition, it enables a much more cost effective solution for data consolidation and retention as compared to expensive HDFS deployments.
How data connectors in IBM Spectrum Conductor with Spark 2.2.1 fit in
Before this feature, if you wanted the hosts in your Spark instance group to be able to connect to an external data source (IBM Cloud Object Storage, GPFS, HDFS, etc.), there was a lot of extra configuration required on those hosts. In the context of IBM Spectrum Conductor with Spark 2.2.1, a data connector contains the type, the URI, the authentication method, and all of the required libraries to access the data source. The following is an example of a problem that users might run into:
-
You have 3 data sources on 1 Spark instance group. UserA’s application needs data1 and data3. UserB’s application needs data1 and data2.
-
Without our data connector implementation, your HADOOP_CONFIG_DIR needs to point to the location with the configuration files to your user’s particular data connector set.
-
Every user in this case would need their own configuration set up on the hosts that are used to access the data. This is impossible because there is only 1 configuration location per Spark instance group. A similar gap exists when multiple users access the same data source. The configurations are different between them.
Advantages of using data connectors
Data connectors greatly simplify data source administration and configuration, and separate the credential management and data usage from applications.
-
Instead, we dynamically load the configurations that is specified in the Spark instance group’s data connector and user settings.
-
Users select the data sources to which they need access at batch application submission time.
In addition, there are some advantages to using the IBM Cloud Object Storage data connector, specifically with its included “Stocator” connector package.
-
Advanced connector uniquely designed for object storage
-
Interacts with object store directly; does not use complex Hadoop modules
-
Supports analytic workflows
-
Implements Hadoop FileSystem interface
-
No need to modify Spark or Hadoop
-
Stocator-COS does not require local HDFS
-
As much as 18 times faster for write intensive workloads than built-in Hadoop object storage connectors*
-
Performs up 30 times fewer operations*
-
Less operations > Lower overhead > Lower Cost
(* based on performance testing with the S3a community connector, run with its default parameters: https://arxiv.org/abs/1709.01812)
More information on Stocator can be found at https://arxiv.org/abs/1709.01812.
Data connectors
You can configure multiple data connectors for a Spark instance group and switch between data sources on demand from the cluster management console. When data connectors are configured and deployed with your Spark instance group, you can use them to connect to data sources when you create notebook services and submit Spark batch applications.
| Data Connector Type | Configuration Requirements |
| IBM Cloud Object Storage |
|
| IBM Spectrum Scale (HDFS Transparency) |
|
| HDFS |
|
| Kerberos secured HDFS |
|
| Kerberos TGT secured HDFS |
|
| Multi-User Concurrent access to embedded Derby metastore | This is a special configuration setting that can be turned on or off during Spark instance group registration or modification. When selected, a default data connector is added to enable concurrent user access to be able to run Spark SQL using Derby for the embedded metastore. If not selected, the Derby embedded metastore works only for the first execution user. Hive requires a metastore for centralized metadata. By default, SparkSQL uses an embedded Derby for Hive metastore. SparkSQL might generate db files and store them into working directory. Some problems are that embedded Derby does not allow concurrent access, even for the same user; embedded Derby working files have ACL issues with multiple users; and the working directory is shared by multiple applications (e.g. notebook instances) might clash with each other.
|
The available data connectors can be viewed in one of the new RESTful APIs:
[CONDUCTOR_REST_HOST]:[PORT]/platform/rest/conductor/v1/dataconnectors
Data connectors can be added or removed from Spark instance groups during creation or modification, under a new Data Connectors tab in the wizard.
Here is an example of the creation of an IBM Cloud Object Storage data connector.
Next, users can optionally make some selections about which data connectors are active and which one to use as default. For batch applications, these selections are used as default, but can be changed at run time. For notebook applications, these values are always used.
You can also run status checks in IBM Spectrum Conductor with Spark 2.2.1 for you data connectors to determine the connection status for your external file system. Status checks can be run from three places in the cluster management console:
-
From the Data Connectors tab of the Spark instance group view page.
-
From My Applications & Notebooks, select Open Notebook > Notebook data connectors.
-
From the Data Connectors heading in the Spark application view. Note that from here, you can only check the status of a data connector before the application has finished running.
Here is what status checking in the cluster management console looks like:
After you test the connection, the status will show as ‘Available’, or ‘Error’. If an error has been detected, you can view the error log, and you can retry the status check.
When running a Spark batch application, you can now see a checkbox to enable data connectors. Your default selections from the Spark instance group are applied here, but you can make changes if necessary. You can decide which data connector to use as fs.defaultFS, which can be accessed by “/” in your application code.
The application shown above uses the GPFS data connector for “/”, but multiple file systems can be accessed explicitly in the application. Here is an example of a notebook application. As you can see, the default data connector is being used to access /README.md, but the IBM Cloud Object Storage file system is also being accessed via its own access URI. As noted in the configuration requirement table above, the access key and secret key are specified inside the application.
For information on the data connector RESTful APIs, access the RESTful API documentation in your cluster:
http://<hostname>:8280/cloud/apis/explorer/#/DataConnectors
https://<hostname>:8643/cloud/apis/explorer/#/DataConnectors
Key reasons for using IBM Cloud Object Storage with Spark applications
-
Capability to store huge amounts of unstructured data in any format
-
Resilient data store – storage data will not be lost
-
Object store designed to operate during failures
-
Various security models – data is secure and protected
-
Object stores can be easily accessed for write or read flows
-
On premise, cloud based, and hybrid
-
Analytic job results can be persisted in the object storage
-
Object stores allow ability to easily share data subsets
-
Significant data storage TCO savings
If you want to know more about IBM Cloud Object Storage and data connectors, or if have questions about this blog or its content, post them in our forum! Or, you can also connect with us on our Slack channel!
UID
ibm16163557
