Customizing Spark applications in a service volume instance

You can persist packages to use in Spark applications in a service volume instance.

Using custom Python packages in a service volume instance

This section explains how to load custom Python packages to a service volume instance and use them when you submit PySpark applications as Spark jobs.

Let's assume you want to use the wget Python package in your PySpark application that you downloaded to a folder in one of the instance volumes:

  1. Generate an authorization token. For more information, see Generating an API authorization token.

  2. Create a new volume with name appvol by running the following cURL command. Specify the storageClass as managed-nfs-storage if NFS storage is set up on your cluster and ocs-storagecluster-cephfs if OCS storage is set up on your cluster.

    curl -vk -iv -X POST "https://<CloudPakforData_URL>/zen-data/v2/serviceInstance" -H "Authorization: ZenApiKey ${TOKEN}" -H 'Content-Type: application/json' -d '{"createArguments": {"metadata": {"storageClass": "managed-nfs-storage", "storageSize": "2Gi"}, "resources": {}, "serviceInstanceDescription": "volume 1"}, "preExistingOwner": false, "serviceInstanceDisplayName": "appvol", "serviceInstanceType": "volumes", "serviceInstanceVersion": "-", "transientFields": {}}'
    
  3. Start the file server on volume appvol where you want to upload your Python packages to by using the following cURL command:

    curl -v -ik -X POST 'https://<CloudPakforData_URL>/zen-data/v1/volumes/volume_services/appvol' -H "Authorization: ZenApiKey ${TOKEN}" -d '{}' -H 'Content-Type: application/json' -H 'cache-control: no-cache'
    
  4. Upload the Python package from your local workstation to volume appvol at location pippackages:

    curl -v -ik  -X PUT 'https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/pippackages%2Fwget.py'  -H "Authorization: ZenApiKey ${TOKEN}" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/packages/anaconda3/lib/python3.7/site-packages/wget.py'
    

    In the sample code, /root/packages/anaconda3/lib/python3.7/site-packages/wget.py is the location of your package on your local workstation that your want to upload. Note that you need to specify the absolute path of the package on your local workstation that you are uploading.

  5. Alternatively, upload the Python package as a ZIP file and extract it on volume appvol:

    curl -k -X PUT <https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/pippackages?extract=true> -H "Authorization: ZenApiKey ${TOKEN}" -H 'Content-Type: multipart/form-data' -F upFile='@/Users/test-user/test-data/upload_extract.tar.gz'
    

    In the sample code, /Users/test-user/test-data/upload_extract.tar.gz is the ZIP file on your local workstation that is uploaded and extracted to the pippackages directory on volume appvol.

  6. Upload your PySpark application to volume appvol at location customApps:

    curl -v -ik  -X PUT 'https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/customApps%2Fexample.py'  -H "Authorization: ZenApiKey ${TOKEN}"   -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/jobs/HBCPT/wgetExample/example.py'
    
  7. Submit your PySpark application that uses the packages you uploaded to volume appvol at location pippackages as a Spark job:

    curl -ivk -X POST -d @payload.json -H "Authorization: ZenApiKey ${TOKEN}" <job_API_endpoint>
    

    You can get the Spark submit jobs endpoint from the service instance details page. See Managing Analytics Engine powered by Apache Spark instances.

    The payload.json for Python 3.9:

    {
      "application_details": {
      "application": "/myapp/customApps/example.py",
      "arguments": ["<your_application_arguments>"],
      "conf": {
        "spark.app.name": "MyJob",
        "spark.eventLog.enabled": "true"
      },
      "env": {
        "RUNTIME_PYTHON_ENV": "python39",
        "PYTHONPATH": "/myapp/pippackages:/home/spark/space/assets/data_asset:/home/spark/user_home/python-3:/cc-home/_global_/python-3:/home/spark/shared/user-libs/python:/home/spark/shared/conda/envs/python/lib/python/site-packages:/opt/ibm/conda/miniconda/lib/python/site-packages:/opt/ibm/third-party/libs/python3:/opt/ibm/image-libs/python3:/opt/ibm/image-libs/spark2/metaindexmanager.jar:/opt/ibm/image-libs/spark2/stmetaindexplugin.jar:/opt/ibm/spark/python:/opt/ibm/spark/python/lib/py4j-0.10.7-src.zip"
      }
      },
      "volumes": [{
        "name": "appvol",
        "mount_path": "/myapp",
        "source_sub_path": ""
      }]
    }
    

    You can get the value of the environment variable PYTHONPATH by dumping the value of PYTHONPATH in your Spark job driver log. In the sample payload, /myapp/pippackages in PYTHONPATH is the location where the Python package was uploaded to.

    Another way of making the Python packages visible to your PySpark application, instead of setting PYTHONPATH in the job payload, is by adding the following lines to the top of your PySpark application:

    import sys
    sys.path.append('/myapp/pippackages/')
    

Parent topic: Using custom packages