Installing on a Spark cluster
Learn how to install Databand on your Spark cluster.
Before you begin, make sure that you can access the Databand server from your Spark cluster.
Python integration
You must define the following variables in your Spark context.
-
DBND__CORE__DATABAND_URL- a Databand server URL -
DBND__CORE__DATABAND_ACCESS_TOKEN- a Databand server access token -
DBND__TRACKING=True -
DBND__ENABLE__SPARK_CONTEXT_ENV=True
Install dbnd-spark. See Installing
the Python SDK and the Cluster bootstrap.
JVM integration
The following environment variables must be defined in your Spark context:
-
DBND__CORE__DATABAND_URL- a Databand server URL -
DBND__CORE__DATABAND_ACCESS_TOKEN- a Databand server access token -
DBND__TRACKING=TrueFor more information about the available parameters, see Installing JVM SDK and agent. Your cluster must have Databand JAR to be able to use listener and other features. For more information about how to configure Spark clusters, see the following section.
Spark clusters
With most clusters, you can set up Spark environment variables through cluster metadata or bootstrap scripts. In the following sections, you can find provider-specific instructions.
EMR cluster
Complete the following steps to install Databand on your EMR cluster:
- Use environment variables to configure Databand:
Define the environment variables at the API call or
EmrCreateJobFlowOperatorAirflow Operator. Alternatively, you can provide these variables from AWS UI when you create a new cluster. The EMR cluster doesn't have a way of defining environment variables in the bootstrap. Consult with the official EMR documentation on Spark Configuration if you use a custom operator or if you create clusters outside of Airflow.from airflow.hooks.base_hook import BaseHook dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson databand_url = dbnd_config["core"]["databand_url"] databand_access_token = dbnd_config["core"]["databand_access_token"] emr_create_job_flow = EmrCreateJobFlowOperator( job_flow_overrides={ "Name": "<EMR Cluster Name>", #... "Configurations": [ { "Classification": "spark-env", "Configurations": [ { "Classification": "export", "Properties": { "DBND__TRACKING": "True", "DBND__ENABLE__SPARK_CONTEXT_ENV": "True", "DBND__CORE__DATABAND_URL": databand_url, "DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token, }, } ], } ] } #... ) - Install Databand on your cluster:
Use the following snippet to install Python and JVM integrations:
#!/usr/bin/env bash DBND_VERSION=REPLACE_WITH_DBND_VERSION sudo python3 -m pip install pandas==1.2.0 pydeequ==1.0.1 databand==${DBND_VERSION} sudo wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/Add this script to your cluster bootstrap actions list. For more information, see bootstrap actions documentation.
Databricks cluster
- Use environment variables to configure Databand:
In the cluster configuration screen, click Edit > Advanced Options > Spark. Inside the Environment Variables section, declare the listed configuration variables. Be sure to replace
<databand-url>and<databand-access-token>with your environment-specific information:-
DBND__TRACKING="True" -
DBND__ENABLE__SPARK_CONTEXT_ENV="True" -
DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL" -
DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"
-
- Install the Python
dbndlibrary in your Databricks cluster:Under the Libraries tab of your cluster's configuration:
- Click Install New.
- Choose PyPI.
- Enter
databand==REPLACE_WITH_DBND_VERSIONas the Package name. - Click Install.
- Install the Python
dbndlibrary for your specific Airflow operator:Do not use this mode in production, use it only for trying out
dbndin specific tasks. Make sure that thedbndlibrary is installed on the Databricks cluster by addingdatabandto thelibrariesparameter of theDatabricksSubmitRunOperator, as shown in the following example:DatabricksSubmitRunOperator( #... json={"libraries": [ {"pypi": {"package": "databand==REPLACE_WITH_DBND_VERSION"}}, ]}, ) - Track Scala or Java Spark jobs
- Download
dbndagent and place it into your DBFS working folder. - First, make sure that you publish the agent to
/dbfs/apps/. For more configuration options, see the Databricks Runs Submit API documentation. - Configure Tracking Spark applications with automatic dataset logging by adding
ai.databand.spark.DbndSparkQueryExecutionListeneras aspark.sql.queryExecutionListenersand addai.databand.spark.DbndSparkListeneras aspark.extraListeners. This mode works only if you enable thedbndagent. - Use the following configuration of the Databricks job to enable Databand Java Agent with
automatic dataset
tracking:
spark_operator = DatabricksSubmitRunOperator( json={ #... "new_cluster": { #... "spark_conf": { "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener", "spark.extraListeners": "ai.databand.spark.DbndSparkListener", "spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar", }, }, #... })
- Download
GoogleCloud DataProc cluster
Complete the following steps to set up your GoogleCloud DataProc cluster:
- Set up your cluster.Define environment variables during the cluster setup or add these variables to your bootstrap as described in the previous sections about Spark clusters.
from airflow.hooks.base_hook import BaseHook dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson databand_url = dbnd_config["core"]["databand_url"] databand_access_token = dbnd_config["core"]["databand_access_token"] cluster_create = DataprocClusterCreateOperator( # ... properties={ "spark-env:DBND__TRACKING": "True", "spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True", "spark-env:DBND__CORE__DATABAND_URL": databand_url, "spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token, }, # ... ) -
Use the same operator to install Databand PySpark support:
cluster_create = DataprocClusterCreateOperator( #... properties={ "dataproc:pip.packages": "dbnd==REPLACE_WITH_DATABAND_VERSION", } #... ) -
The Dataproc cluster supports initialization. The following script installs Databand libraries and sets up the environment variables that are required for tracking:
#!/usr/bin/env bash DBND_VERSION=REPLACE_WITH_DBND_VERSION ## to use conda-provided python instead of system one export PATH=/opt/conda/default/bin:${PATH} python3 -m pip install pydeequ==1.0.1 databand==${DBND_VERSION} DBND__CORE__DATABAND_ACCESS_TOKEN=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_ACCESS_TOKEN) sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=${DBND__CORE__DATABAND_ACCESS_TOKEN} >> /usr/lib/spark/conf/spark-env.sh" DBND__CORE__DATABAND_URL=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_URL) sh -c "echo DBND__CORE__DATABAND_URL=${DBND__CORE__DATABAND_URL} >> /usr/lib/spark/conf/spark-env.sh" sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh" sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"Note:Variables like access token and tracker url can be passed to the initialization action through cluster metadata properties. For more information, see the official DataProc documentation.
- Publish your JAR to
Google Storage, then use the following configuration of the PySpark DataProc job to enable Databand
Spark query listener with automatic dataset
tracking:
pyspark_operator = DataProcPySparkOperator( #... dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"], dataproc_pyspark_properties={ "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener", "spark.extraListeners": "ai.databand.spark.DbndSparkListener", }, #... )
For the list of all supported operators and extra information, see Tracking remote tasks.
Cluster bootstrap
If you are using a custom cluster installation, you must install Databand packages, agent, and configure environment variables for tracking.
- Set up your cluster:
- Add the following commands to your cluster initialization script:
#!/bin/bash -x DBND_VERSION=REPLACE_WITH_DBND_VERSION ## Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR) sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh" sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh" ## if you use Listeners/Agent, download Databand Agent which includes all JARs wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/ ## install Databand Python package together with Airflow support python -m pip install databand==${DBND_VERSION} - Install the
dbndpackages on the Spark main and Spark workers by runningpip install databandat bootstrap or manually. - Make sure that you don't install "dbnd-airflow" to the cluster.
- Add the following commands to your cluster initialization script:
- Provide Databand credentials for your cluster bootstrap:If your cluster type supports configuring the environment variables with a bootstrap script You can use your bootstrap script to define Databand credentials on the cluster level. Be sure to replace
<databand-url>and<databand-access-token>with your environment-specific information.
sh -c "echo DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL >> /usr/lib/spark/conf/spark-env.sh" sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN >> /usr/lib/spark/conf/spark-env.sh" - Configure the Databand agent path and query listener for Spark operators:
Databand can automatically alter the
spark-submitcommand for various Spark operators and inject the Agent JAR into the class path and enable query listener. You can use the following options to configure thedbnd_configairflow connection:{ "tracking_spark": { "query_listener": true, "agent_path": "/home/hadoop/dbnd-agent-latest-all.jar", "jar_path": null } }query_listener- Enables Databand Spark query listener for auto-capturing dataset operations from Spark jobs
agent_path- Th path to the Databand Java Agent FatJar. If this path is provided, Databand includes this
Agent in the Spark Job through the
spark.driver.extraJavaOptionsconfiguration option. An agent is required if you want to track Java or Scala jobs that are annotated with@Task.The agent must be placed in a cluster local file system for proper functioning. jar_path- The path to the Databand Java Agent FatJar. If this path is provided, Databand includes the JAR
in the Spark Job through the
spark.jarsconfiguration option. The JAR can be placed in a local file system or a S3/GCS/DBFS path.
Properties can be configured with environment variables or .cfg files. For more information, see SDK configuration.
Next steps
See the Tracking Python section for
implementing dbnd within your PySpark jobs. See the Tracking Spark and JVM
applications for Spark and JVM jobs.