IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
Persisting Spark applications
You can choose how to persist your Spark application job files.
You can save those files:
- As an asset in a deployment space
- In Object Storage
- In a supported volumes (NFS/Portworx/OCS) created using Volumes API
Persisting Spark applications in a deployment space
You can only persist Spark applications in a deployment space if the Spark advanced features were enabled. See Using advanced features.
Follow these steps to persist Spark applications as an asset in a deployment space:
-
Get the deployment space name from the service instance details page. See Managing Analytics Engine powered by Apache Spark instances.
-
From the navigation menu
in Cloud Pak for Data, click Deployments and select your space.
-
From the Assets page of the space, upload your Spark application.
-
Run the application as a persisted asset. Use the following Spark job payload as an example:
{ "application_details": { "application": "/home/spark/space/assets/data_asset/<spark_application_name>", "arguments": [""], "class": "<main_class>" } }
Persisting Spark applications in Object Storage
The application job files can be stored in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.
Follow these steps to persist a Spark application in IBM Cloud Object Storage:
-
Upload the application job file (
<OBJECT_NAME>) to an IBM Cloud Object Storage bucket (<BUCKET_NAME>) in a IBM Cloud Object Storage service (<COS_SERVICE_NAME>). -
Ensure that the following Spark environment properties are passed in the payload:
"spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint":"<COS_ENDPOINT>" "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key":"<COS_SECRET_KEY>" "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key":"<COS_ACCESS_KEY>" -
Run the application persisted in IBM Cloud Object Storage. Use the following Spark job payload as an example:
{ "application_details": { "application": "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>", "arguments": [ "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>" ], "class": "<main_class>", "conf": { "spark.app.name": "MyJob", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint": "<COS_ENDPOINT>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key": "<COS_SECRET_KEY>", "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key": "<COS_ACCESS_KEY>" } } }
Persisting Spark applications in a service volume instance
You can persist Spark application job files in any one of the supported IBM Cloud Pak for Data volumes:
- NFS storage
- Portworx
- OCS
To learn how to use a volume instance to create directories and add your application files, see Managing persistent volume instances with the Volumes API.
The following example shows a Spark application that is uploaded under the customApps directory inside the cpd-instance::vol1 volume, which is mounted as /myapp on the Spark cluster. There is an additional volume cpd-instance::vol2 which is mounted as /data.
{
"application_details": {
"application": "/myapp/<spark_application>",
"arguments": [
""
],
"conf": {
"spark.app.name": "JSFVT",
"spark.executor.extraClassPath": "/myapp/*",
"spark.driver.extraClassPath": "/myapp/*"
}
},
"volumes": [
{
"name": "cpd-instance::vol1",
"mount_path": "/myapp",
"source_sub_path": "customApps"
},
{
"name": "cpd-instance::vol2",
"source_sub_path": "",
"mount_path": "/data"
}
]
}
The following example shows how you can persist data using volumes in Spark interactive applications (kernels). In this case, the data is uploaded inside the cpd-instance::vol1 volume, which is mounted as /data on the
Spark cluster.
{
"name": "scala",
"kernel_size": {
"cpu": 1,
"memory": "1g"
},
"engine": {
"type": "spark",
"conf": {
"spark.ui.reverseProxy": "false",
"spark.eventLog.enabled": "false"
},
"size": {
"num_workers": "2",
"worker_size": {
"cpu": 1,
"memory": "1g"
}
},
"volumes": [
{
"name": "cpd-instance::vol1",
"source_sub_path": "",
"mount_path": "/data"
}
]
}
}
Parent topic: Getting started with Spark applications