Accessing data from applications
Applies to :
Spark engine
Apache Gluten accelerated Spark engine
When you use the Spark runtimes API, you can store the application runtime files and your data files in storage volumes that you can managed by using the IBM Cloud Pak for Data Volumes API or alternatively, you can provision an instance of IBM Cloud Object Storage.
- Working with files in external volumes
In Spark applications run using Analytics Engine powered by Apache Spark, a common way to reference the Spark runtime files, input data or the output data is through external storage volumes that you can manage by using the IBM Cloud Pak for Data Volumes API.
You can work with the following external volumes:
- External NFS storage volume
- See how to create a volume on an external NFS server in Create a volume on an external NFS server.
- Existing persistent volume claim
- See Create a volume in a persistent volume claim.
- New volume instance
- See Managing persistent volume instances.
To learn how to use a volume instance to create directories and add your application files, see Managing persistent volume instances with the Volumes API.
- Working with files in multiple storage volumes
You can use multiple storage volumes when creating the Spark runtime payload.
customApps directory inside the vol1 volume, which is mounted as
/myapp on the Spark cluster. The user data is in the vol2 volume
which is mounted as /data on the Spark cluster.{
"application_details": {
"application": "/myapp/<spark_application>",
"arguments": [
""
],
"conf": {
"spark.app.name": "JSFVT",
"spark.executor.extraClassPath": "/myapp/*",
"spark.driver.extraClassPath": "/myapp/*"
}
},
"volumes": [
{
"name": "<project_name>::vol1",
"mount_path": "/myapp",
"source_sub_path": "customApps"
},
{
"name": "<project_name>::vol2",
"source_sub_path": "",
"mount_path": "/data"
}
]
}
- Working with files in Object Storage
You can store the runtime files and your data in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.
- Create your application, for example a Python program file
cosExample.py:
from __future__ import print_function import sys import calendar import time from pyspark.sql import SparkSession if __name__ == "__main__": if len(sys.argv) != 5: print("Usage: cosExample <access-key> <secret-key> <endpoint> <bucket>", file=sys.stderr) sys.exit(-1) spark = SparkSession.builder.appName("CosExample").getOrCreate() prefix = "fs.cos.llservice" hconf = spark.sparkContext._jsc.hadoopConfiguration() hconf.set(prefix +".endpoint", sys.argv[3]) hconf.set(prefix + ".access.key", sys.argv[1]) hconf.set(prefix + ".secret.key", sys.argv[2]) data = [1, 2, 3, 4, 5, 6] distData = spark.sparkContext.parallelize(data) distData.count() path = "cos://{}.llservice/{}".format(sys.argv[4], calendar.timegm(time.gmtime())) distData.saveAsTextFile(path) rdd = spark.sparkContext.textFile(path) print ("output rdd count: {}". format(rdd.count())) spark.stop()
- Load the runtime file. To load the runtime file from an external volume, upload cosExample.py under the
customAppsdirectory in the storage volumevol1, which is mounted as/myappin the Spark cluster:{ "application_details": { "application": "/myapp/cosExample.py", "arguments": ["<ACCESS_KEY>", "<COS_SECRET_KEY>", "<COS_ENDPOINT>", "<BUCKET_NAME>"], "class": "org.apache.spark.deploy.SparkSubmit", "conf": { "spark.app.name": "Job1", "spark.executor.extraClassPath": "/myapp/*", "spark.driver.extraClassPath": "/myapp/*" } }, "volumes": [{ "name": "<project_name>::vol1", "mount_path": "/myapp", "source_sub_path": "customApps" }] }
- Alternatively, to load the runtime file from an IBM Cloud Object Storage bucket, upload the runtime file
in
<OBJECT_NAME>from the bucket<BUCKET_NAME>in IBM Cloud Object Storage service (<COS_SERVICE_NAME>):{ "application_details": { "application": "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>", "arguments": ["cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>"], "class": "<main_class>", "conf": { "spark.app.name": "MyJob", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint": "<COS_ENDPOINT>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key": "<COS_SECRET_KEY>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key": "<COS_ACCESS_KEY>" } } }