Customizing Spark applications in a service volume instance
You can persist packages to use in Spark applications in a service volume instance.
Using custom Python packages in a service volume instance
This section explains how to load custom Python packages to a service volume instance and use them when you submit PySpark applications as Spark jobs.
Let's assume you want to use the wget
Python package in your PySpark application that you downloaded to a folder in one of the instance volumes:
-
Generate an authorization token. For more information, see Generating an API authorization token.
-
Create a new volume with name
appvol
by running the following cURL command. Specify thestorageClass
asmanaged-nfs-storage
if NFS storage is set up on your cluster andocs-storagecluster-cephfs
if OCS storage is set up on your cluster.curl -vk -iv -X POST "https://<CloudPakforData_URL>/zen-data/v2/serviceInstance" -H "Authorization: ZenApiKey ${TOKEN}" -H 'Content-Type: application/json' -d '{"createArguments": {"metadata": {"storageClass": "managed-nfs-storage", "storageSize": "2Gi"}, "resources": {}, "serviceInstanceDescription": "volume 1"}, "preExistingOwner": false, "serviceInstanceDisplayName": "appvol", "serviceInstanceType": "volumes", "serviceInstanceVersion": "-", "transientFields": {}}'
-
Start the file server on volume
appvol
where you want to upload your Python packages to by using the following cURL command:curl -v -ik -X POST 'https://<CloudPakforData_URL>/zen-data/v1/volumes/volume_services/appvol' -H "Authorization: ZenApiKey ${TOKEN}" -d '{}' -H 'Content-Type: application/json' -H 'cache-control: no-cache'
-
Upload the Python package from your local workstation to volume
appvol
at locationpippackages
:curl -v -ik -X PUT 'https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/pippackages%2Fwget.py' -H "Authorization: ZenApiKey ${TOKEN}" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/packages/anaconda3/lib/python3.7/site-packages/wget.py'
In the sample code,
/root/packages/anaconda3/lib/python3.7/site-packages/wget.py
is the location of your package on your local workstation that your want to upload. Note that you need to specify the absolute path of the package on your local workstation that you are uploading. -
Alternatively, upload the Python package as a ZIP file and extract it on volume
appvol
:curl -k -X PUT <https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/pippackages?extract=true> -H "Authorization: ZenApiKey ${TOKEN}" -H 'Content-Type: multipart/form-data' -F upFile='@/Users/test-user/test-data/upload_extract.tar.gz'
In the sample code,
/Users/test-user/test-data/upload_extract.tar.gz
is the ZIP file on your local workstation that is uploaded and extracted to the pippackages directory on volumeappvol
. -
Upload your PySpark application to volume
appvol
at locationcustomApps
:curl -v -ik -X PUT 'https://<CloudPakforData_URL>/zen-volumes/appvol/v1/volumes/files/customApps%2Fexample.py' -H "Authorization: ZenApiKey ${TOKEN}" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/root/jobs/HBCPT/wgetExample/example.py'
-
Submit your PySpark application that uses the packages you uploaded to volume
appvol
at locationpippackages
as a Spark job:curl -ivk -X POST -d @payload.json -H "Authorization: ZenApiKey ${TOKEN}" <job_API_endpoint>
You can get the Spark submit jobs endpoint from the service instance details page. See Managing Analytics Engine powered by Apache Spark instances.
The
payload.json
for Python 3.9:{ "application_details": { "application": "/myapp/customApps/example.py", "arguments": ["<your_application_arguments>"], "conf": { "spark.app.name": "MyJob", "spark.eventLog.enabled": "true" }, "env": { "RUNTIME_PYTHON_ENV": "python39", "PYTHONPATH": "/myapp/pippackages:/home/spark/space/assets/data_asset:/home/spark/user_home/python-3:/cc-home/_global_/python-3:/home/spark/shared/user-libs/python:/home/spark/shared/conda/envs/python/lib/python/site-packages:/opt/ibm/conda/miniconda/lib/python/site-packages:/opt/ibm/third-party/libs/python3:/opt/ibm/image-libs/python3:/opt/ibm/image-libs/spark2/metaindexmanager.jar:/opt/ibm/image-libs/spark2/stmetaindexplugin.jar:/opt/ibm/spark/python:/opt/ibm/spark/python/lib/py4j-0.10.7-src.zip" } }, "volumes": [{ "name": "appvol", "mount_path": "/myapp", "source_sub_path": "" }] }
You can get the value of the environment variable PYTHONPATH by dumping the value of PYTHONPATH in your Spark job driver log. In the sample payload,
/myapp/pippackages
in PYTHONPATH is the location where the Python package was uploaded to.Another way of making the Python packages visible to your PySpark application, instead of setting PYTHONPATH in the job payload, is by adding the following lines to the top of your PySpark application:
import sys sys.path.append('/myapp/pippackages/')
Parent topic: Using custom packages