Persisting Spark applications
You can choose how to persist your Spark application job files.
You can save those files:
- As an asset in a deployment space
- In Object Storage
- In a supported volumes (NFS/Portworx/OCS) created using Volumes API
Persisting Spark applications in a deployment space
You can only persist Spark applications in a deployment space if the Spark advanced features were enabled. See Using advanced features.
Follow these steps to persist Spark applications as an asset in a deployment space:
-
Get the deployment space name from the service instance details page. See Managing Analytics Engine powered by Apache Spark instances.
-
From the navigation menu
in Cloud Pak for Data, click Deployments and select your space.
-
From the Assets page of the space, upload your Spark application.
-
Run the application as a persisted asset. Use the following Spark job payload as an example:
{ "application_details": { "application": "/home/spark/space/assets/data_asset/<spark_application_name>", "arguments": [""], "class": "<main_class>" } }
Persisting Spark applications in Object Storage
The application job files can be stored in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.
Follow these steps to persist a Spark application in IBM Cloud Object Storage:
-
Upload the application job file (
<OBJECT_NAME>
) to an IBM Cloud Object Storage bucket (<BUCKET_NAME>
) in a IBM Cloud Object Storage service (<COS_SERVICE_NAME>
). -
Ensure that the following Spark environment properties are passed in the payload:
"spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint":"<COS_ENDPOINT>" "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key":"<COS_SECRET_KEY>" "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key":"<COS_ACCESS_KEY>"
-
Run the application persisted in IBM Cloud Object Storage. Use the following Spark job payload as an example:
{ "application_details": { "application": "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>", "arguments": [ "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>" ], "class": "<main_class>", "conf": { "spark.app.name": "MyJob", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint": "<COS_ENDPOINT>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key": "<COS_SECRET_KEY>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key": "<COS_ACCESS_KEY>" } } }
Persisting Spark applications in a service volume instance
You can persist Spark application job files in any one of the supported IBM Cloud Pak for Data volumes:
- NFS storage
- Portworx
- OCS
To learn how to use a volume instance to create directories and add your application files, see Managing persistent volume instances with the Volumes API.
The following example shows a Spark application that is uploaded under the customApps directory inside the cpd-instance::vol1
volume, which is mounted as /myapp
on the Spark cluster. There is an additional volume cpd-instance::vol2
which is mounted as /data
.
{
"application_details": {
"application": "/myapp/<spark_application>",
"arguments": [
""
],
"conf": {
"spark.app.name": "JSFVT",
"spark.executor.extraClassPath": "/myapp/*",
"spark.driver.extraClassPath": "/myapp/*"
}
},
"volumes": [
{
"name": "cpd-instance::vol1",
"mount_path": "/myapp",
"source_sub_path": "customApps"
},
{
"name": "cpd-instance::vol2",
"source_sub_path": "",
"mount_path": "/data"
}
]
}
The following example shows how you can persist data using volumes in Spark interactive applications (kernels). In this case, the data is uploaded inside the cpd-instance::vol1
volume, which is mounted as /data
on the
Spark cluster.
{
"name": "scala",
"kernel_size": {
"cpu": 1,
"memory": "1g"
},
"engine": {
"type": "spark",
"conf": {
"spark.ui.reverseProxy": "false",
"spark.eventLog.enabled": "false"
},
"size": {
"num_workers": "2",
"worker_size": {
"cpu": 1,
"memory": "1g"
}
},
"volumes": [
{
"name": "cpd-instance::vol1",
"source_sub_path": "",
"mount_path": "/data"
}
]
}
}
Parent topic: Getting started with Spark applications