Persisting Spark applications

You can choose how to persist your Spark application job files.

You can save those files:

Persisting Spark applications in a deployment space

You can only persist Spark applications in a deployment space if the Spark advanced features were enabled. See Using advanced features.

Follow these steps to persist Spark applications as an asset in a deployment space:

  1. Get the deployment space name from the service instance details page. See Managing Analytics Engine powered by Apache Spark instances.

  2. From the navigation menu Cloud Pak for Data navigation menu in Cloud Pak for Data, click Deployments and select your space.

  3. From the Assets page of the space, upload your Spark application.

  4. Run the application as a persisted asset. Use the following Spark job payload as an example:

    {
       "application_details": {
        "application": "/home/spark/space/assets/data_asset/<spark_application_name>",
        "arguments": [""],
        "class": "<main_class>"
      }
    }
    

Persisting Spark applications in Object Storage

The application job files can be stored in a S3 compatible Object Storage bucket. The following steps describe how this can be done for an IBM Cloud Object Storage bucket.

Follow these steps to persist a Spark application in IBM Cloud Object Storage:

  1. Upload the application job file (<OBJECT_NAME>) to an IBM Cloud Object Storage bucket (<BUCKET_NAME>) in a IBM Cloud Object Storage service (<COS_SERVICE_NAME>).

  2. Ensure that the following Spark environment properties are passed in the payload:

    "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint":"<COS_ENDPOINT>"
    "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key":"<COS_SECRET_KEY>"
    "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key":"<COS_ACCESS_KEY>"
    
  3. Run the application persisted in IBM Cloud Object Storage. Use the following Spark job payload as an example:

    {
      "application_details": {
        "application": "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>",
        "arguments": [
          "cos://<BUCKET_NAME>.<COS_SERVICE_NAME>/<OBJECT_NAME>"
        ],
        "class": "<main_class>",
        "conf": {
          "spark.app.name": "MyJob",
          "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.endpoint": "<COS_ENDPOINT>",
          "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.secret.key": "<COS_SECRET_KEY>",
          "spark.hadoop.fs.cos.<COS_SERVICE_NAME>.access.key": "<COS_ACCESS_KEY>"
        }
      }
    }
    

Persisting Spark applications in a service volume instance

You can persist Spark application job files in any one of the supported IBM Cloud Pak for Data volumes:

  • NFS storage
  • Portworx
  • OCS

To learn how to use a volume instance to create directories and add your application files, see Managing persistent volume instances with the Volumes API.

The following example shows a Spark application that is uploaded under the customApps directory inside the cpd-instance::vol1 volume, which is mounted as /myapp on the Spark cluster. There is an additional volume cpd-instance::vol2 which is mounted as /data.

{
  "application_details": {
    "application": "/myapp/<spark_application>",
    "arguments": [
      ""
    ],
    "conf": {
      "spark.app.name": "JSFVT",
      "spark.executor.extraClassPath": "/myapp/*",
      "spark.driver.extraClassPath": "/myapp/*"
    }
  },
  "volumes": [
    {
      "name": "cpd-instance::vol1",
      "mount_path": "/myapp",
      "source_sub_path": "customApps"
    },
    {
      "name": "cpd-instance::vol2",
      "source_sub_path": "",
      "mount_path": "/data"
    }
  ]
}

The following example shows how you can persist data using volumes in Spark interactive applications (kernels). In this case, the data is uploaded inside the cpd-instance::vol1 volume, which is mounted as /data on the Spark cluster.

{
  "name": "scala",
  "kernel_size": {
    "cpu": 1,
    "memory": "1g"
  },
  "engine": {
    "type": "spark",
    "conf": {
      "spark.ui.reverseProxy": "false",
      "spark.eventLog.enabled": "false"
    },
    "size": {
      "num_workers": "2",
      "worker_size": {
        "cpu": 1,
        "memory": "1g"
      }
    },
    "volumes": [
      {
        "name": "cpd-instance::vol1",
        "source_sub_path": "",
        "mount_path": "/data"
      }
    ]
  }
}

Parent topic: Getting started with Spark applications