Installing on a Spark cluster
Learn how to install Databand on your Spark cluster.
General installation
Make sure that you can access the Databand server from your Spark cluster.
Python integration
You must define the following variables in your Spark context.
DBND__CORE__DATABAND_URL
- a Databand server URLDBND__CORE__DATABAND_ACCESS_TOKEN
- a Databand server access tokenDBND__TRACKING=True
DBND__ENABLE__SPARK_CONTEXT_ENV=True
Install dbnd-spark
. See Installing the Python SDK and bootstrap example in the Cluster bootstrap section.
JVM integration
The following environment variables must be defined in your Spark context:
-
DBND__CORE__DATABAND_URL
- a Databand server URL -
DBND__CORE__DATABAND_ACCESS_TOKEN
- a Databand server access token -
DBND__TRACKING=True
For more information about the available parameters, see Installing JVM SDK and agent. Your cluster must have Databand JAR to be able to use listener and other features. For more information about how to configure Spark clusters, see the following section.
Spark clusters
With most clusters, you can set up Spark environment variables through cluster metadata or bootstrap scripts. In the following sections, you can find provider-specific instructions.
EMR cluster
Setting Databand configuration by using environment variables
You need to define the environment variables at the API call or EmrCreateJobFlowOperator
Airflow Operator. Alternatively, you can provide these variables from AWS UI when you create a new cluster. The EMR cluster doesn't have
a way of defining environment variables in the bootstrap. Consult with the official EMR documentation on Spark Configuration if you use a custom
operator or if you create clusters outside of Airflow.
from airflow.hooks.base_hook import BaseHook
dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]
emr_create_job_flow = EmrCreateJobFlowOperator(
job_flow_overrides={
"Name": "<EMR Cluster Name>",
#...
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"DBND__TRACKING": "True",
"DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
"DBND__CORE__DATABAND_URL": databand_url,
"DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
},
}
],
}
]
}
#...
)
Installing Databand on cluster
As the EMR cluster has support for bootstrap actions, the following snippet can be used to install Python and JVM integrations:
#!/usr/bin/env bash
DBND_VERSION=REPLACE_WITH_DBND_VERSION
sudo python3 -m pip install pandas==1.2.0 pydeequ==1.0.1 databand==${DBND_VERSION}
sudo wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
Add this script to your cluster bootstrap actions list. For more information, see bootstrap actions documentation.
Databricks cluster
Setting Databand configuration with environment variables
In the cluster configuration screen, click Edit > Advanced Options > Spark. Inside the Environment Variables section, declare the listed configuration variables. Be sure to replace <databand-url>
and <databand-access-token>
with your environment-specific information:
DBND__TRACKING="True"
DBND__ENABLE__SPARK_CONTEXT_ENV="True"
DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL"
DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"
Install Python dbnd
library in Databricks cluster
Under the Libraries tab of your cluster's configuration:
- Click Install New.
- Choose PyPI.
- Enter
databand==REPLACE_WITH_DBND_VERSION
as the Package name. - Click Install.
Install Python dbnd
library for specific Airflow operator
Do not use this mode in production, use it only for trying out dbnd
in specific tasks. Make sure that the dbnd
library is installed on the Databricks cluster by adding databand
to the libraries
parameter of the DatabricksSubmitRunOperator
, as shown in the following example:
DatabricksSubmitRunOperator(
#...
json={"libraries": [
{"pypi": {"package": "databand==REPLACE_WITH_DBND_VERSION"}},
]},
)
Tracking Scala or Java Spark jobs
Download dbnd
agent and place it into your DBFS working folder.
To configure Tracking Spark applications with automatic dataset logging, add ai.databand.spark.DbndSparkQueryExecutionListener
as a spark.sql.queryExecutionListeners
and add ai.databand.spark.DbndSparkListener
as a spark.extraListeners
(this mode works only if you enable the dbnd agent).
Use the following configuration of the Databricks job to enable Databand Java Agent with automatic dataset tracking:
spark_operator = DatabricksSubmitRunOperator(
json={
#...
"new_cluster": {
#...
"spark_conf": {
"spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
"spark.extraListeners": "ai.databand.spark.DbndSparkListener",
"spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
},
},
#...
})
First, make sure that you publish the agent to /dbfs/apps/
. For more configuration options, see the Databricks Runs Submit API documentation.
GoogleCloud DataProc cluster
Cluster setup
You can define environment variables during the cluster setup or add these variables to your bootstrap as described in the previous sections about Spark clusters.
from airflow.hooks.base_hook import BaseHook
dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]
cluster_create = DataprocClusterCreateOperator(
# ...
properties={
"spark-env:DBND__TRACKING": "True",
"spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
"spark-env:DBND__CORE__DATABAND_URL": databand_url,
"spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
},
# ...
)
You can install Databand PySpark support by using the same operator:
cluster_create = DataprocClusterCreateOperator(
#...
properties={
"dataproc:pip.packages": "dbnd==REPLACE_WITH_DATABAND_VERSION",
}
#...
)
The Dataproc cluster supports initialization. The following script installs Databand libraries and sets up the environment variables that are required for tracking:
#!/usr/bin/env bash
DBND_VERSION=REPLACE_WITH_DBND_VERSION
## to use conda-provided python instead of system one
export PATH=/opt/conda/default/bin:${PATH}
python3 -m pip install pydeequ==1.0.1 databand==${DBND_VERSION}
DBND__CORE__DATABAND_ACCESS_TOKEN=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_ACCESS_TOKEN)
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=${DBND__CORE__DATABAND_ACCESS_TOKEN} >> /usr/lib/spark/conf/spark-env.sh"
DBND__CORE__DATABAND_URL=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_URL)
sh -c "echo DBND__CORE__DATABAND_URL=${DBND__CORE__DATABAND_URL} >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
Note: Variables like access token and tracker url can be passed to the initialization action through cluster metadata properties. For more information, see official Dataproc documentation.
Tracking Python Spark jobs
Use the following configuration of the PySpark DataProc job to enable Databand Spark query listener with automatic dataset tracking:
pyspark_operator = DataProcPySparkOperator(
#...
dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"],
dataproc_pyspark_properties={
"spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
"spark.extraListeners": "ai.databand.spark.DbndSparkListener",
},
#...
)
- First, publish your JAR to Google Storage.
For the list of all supported operators and extra information, see Tracking remote tasks.
Cluster bootstrap
If you are using a custom cluster installation, you must install Databand packages, agent, and configure environment variables for tracking.
Add the following commands to your cluster initialization script:
#!/bin/bash -x
DBND_VERSION=REPLACE_WITH_DBND_VERSION
## Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR)
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
## if you use Listeners/Agent, download Databand Agent which includes all JARs
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
## install Databand Python package together with Airflow support
python -m pip install databand==${DBND_VERSION}
- Install the
dbnd
packages on the Spark main and Spark workers by runningpip install databand
at bootstrap or manually. - Make sure that you don't install "dbnd-airflow" to the cluster.
How to provide Databand credentials with cluster bootstrap
If your cluster type supports configuring the environment variables with a bootstrap script You can use your bootstrap script to define Databand credentials on the cluster level:
sh -c "echo DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN >> /usr/lib/spark/conf/spark-env.sh"
- Be sure to replace
<databand-url>
and<databand-access-token>
with your environment-specific information.
Databand agent path and query listener configuration for Spark operators
Databand can automatically alter the spark-submit
command for various Spark operators and inject the Agent JAR into the class path and enable query listener. You can use the following options to configure the dbnd_config
airflow connection:
{
"tracking_spark": {
"query_listener": true,
"agent_path": "/home/hadoop/dbnd-agent-latest-all.jar",
"jar_path": null
}
}
query_listener
- enables Databand Spark query listener for auto-capturing dataset operations from Spark jobs.agent_path
- path to the Databand Java Agent FatJar. If this path is provided, Databand includes this Agent in the Spark Job through thespark.driver.extraJavaOptions
configuration option. Agent is required if you want to track Java or Scala jobs that are annotated with@Task.
Agent must be placed in a cluster local file system for proper functioning.jar_path
- path to the Databand Java Agent FatJar. If this path is provided, Databand includes the JAR in the Spark Job through thespark.jars
configuration option. The JAR can be placed in a local file system or a S3/GCS/DBFS path.
Properties can be configured with environment variables or .cfg files. For more information, see SDK configuration.
Next steps
See the Tracking Python section for implementing dbnd
within your PySpark jobs. See the Tracking Spark and JVM applications for Spark and JVM jobs.