Installing on a Spark cluster

Learn how to install Databand on your Spark cluster.

General installation

Make sure that you can access the Databand server from your Spark cluster.

Python integration

You must define the following variables in your Spark context.

DBND__CORE__DATABAND_URL - a Databand server URL
DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server access token
DBND__TRACKING=True
DBND__ENABLE__SPARK_CONTEXT_ENV=True

Install dbnd-spark. See Installing the Python SDK and bootstrap example in the Cluster bootstrap section.

JVM integration

The following environment variables must be defined in your Spark context:

DBND__CORE__DATABAND_URL - a Databand server URL
DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server access token
DBND__TRACKING=True

For more information about the available parameters, see Installing JVM SDK and agent. Your cluster must have Databand JAR to be able to use listener and other features. For more information about how to configure Spark clusters, see the following section.

Spark clusters

With most clusters, you can set up Spark environment variables through cluster metadata or bootstrap scripts. In the following sections, you can find provider-specific instructions.

EMR cluster

Setting Databand configuration by using environment variables

You need to define the environment variables at the API call or EmrCreateJobFlowOperator Airflow Operator. Alternatively, you can provide these variables from AWS UI when you create a new cluster. The EMR cluster doesn't have a way of defining environment variables in the bootstrap. Consult with the official EMR documentation on Spark Configuration if you use a custom operator or if you create clusters outside of Airflow.

from airflow.hooks.base_hook import BaseHook

dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]

emr_create_job_flow = EmrCreateJobFlowOperator(
    job_flow_overrides={
        "Name": "<EMR Cluster Name>",
        #...
        "Configurations": [
            {
                "Classification": "spark-env",
                "Configurations": [
                    {
                        "Classification": "export",
                        "Properties": {
                            "DBND__TRACKING": "True",
                            "DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
                            "DBND__CORE__DATABAND_URL": databand_url,
                            "DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
                        },
                    }
                ],
            }
        ]
    }
    #...
)

Installing Databand on cluster

As the EMR cluster has support for bootstrap actions, the following snippet can be used to install Python and JVM integrations:

#!/usr/bin/env bash

DBND_VERSION=REPLACE_WITH_DBND_VERSION

sudo python3 -m pip install pandas==1.2.0 pydeequ==1.0.1 databand==${DBND_VERSION}
sudo wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/

Add this script to your cluster bootstrap actions list. For more information, see bootstrap actions documentation.

Databricks cluster

Setting Databand configuration with environment variables

In the cluster configuration screen, click Edit > Advanced Options > Spark. Inside the Environment Variables section, declare the listed configuration variables. Be sure to replace <databand-url> and <databand-access-token> with your environment-specific information:

DBND__TRACKING="True"
DBND__ENABLE__SPARK_CONTEXT_ENV="True"
DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL"
DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"

Install Python `dbnd` library in Databricks cluster

Under the Libraries tab of your cluster's configuration:

Click Install New.
Choose PyPI.
Enter databand==REPLACE_WITH_DBND_VERSION as the Package name.
Click Install.

Install Python `dbnd` library for specific Airflow operator

Do not use this mode in production, use it only for trying out dbnd in specific tasks. Make sure that the dbnd library is installed on the Databricks cluster by adding databand to the libraries parameter of the DatabricksSubmitRunOperator, as shown in the following example:

DatabricksSubmitRunOperator(
     #...
     json={"libraries": [
        {"pypi": {"package": "databand==REPLACE_WITH_DBND_VERSION"}},
    ]},
)

Tracking Scala or Java Spark jobs

Download dbnd agent and place it into your DBFS working folder.

To configure Tracking Spark applications with automatic dataset logging, add ai.databand.spark.DbndSparkQueryExecutionListener as a spark.sql.queryExecutionListeners and add ai.databand.spark.DbndSparkListener as a spark.extraListeners (this mode works only if you enable the dbnd agent).

Use the following configuration of the Databricks job to enable Databand Java Agent with automatic dataset tracking:

spark_operator = DatabricksSubmitRunOperator(
    json={
        #...
        "new_cluster": {
            #...
               "spark_conf": {
                "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
                "spark.extraListeners": "ai.databand.spark.DbndSparkListener",
                "spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
            },
        },
        #...
    })

First, make sure that you publish the agent to /dbfs/apps/. For more configuration options, see the Databricks Runs Submit API documentation.

GoogleCloud DataProc cluster

Cluster setup

You can define environment variables during the cluster setup or add these variables to your bootstrap as described in the previous sections about Spark clusters.

from airflow.hooks.base_hook import BaseHook

dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]

cluster_create = DataprocClusterCreateOperator(
     # ...
     properties={
        "spark-env:DBND__TRACKING": "True",
        "spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
        "spark-env:DBND__CORE__DATABAND_URL": databand_url,
        "spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
    },
    # ...
)

You can install Databand PySpark support by using the same operator:

cluster_create = DataprocClusterCreateOperator(
     #...
     properties={
             "dataproc:pip.packages": "dbnd==REPLACE_WITH_DATABAND_VERSION", 
     }
     #...
)

The Dataproc cluster supports initialization. The following script installs Databand libraries and sets up the environment variables that are required for tracking:

#!/usr/bin/env bash

DBND_VERSION=REPLACE_WITH_DBND_VERSION

## to use conda-provided python instead of system one
export PATH=/opt/conda/default/bin:${PATH}

python3 -m pip install pydeequ==1.0.1 databand==${DBND_VERSION}

DBND__CORE__DATABAND_ACCESS_TOKEN=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_ACCESS_TOKEN)
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=${DBND__CORE__DATABAND_ACCESS_TOKEN} >> /usr/lib/spark/conf/spark-env.sh"
DBND__CORE__DATABAND_URL=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_URL)
sh -c "echo DBND__CORE__DATABAND_URL=${DBND__CORE__DATABAND_URL} >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"

Note: Variables like access token and tracker url can be passed to the initialization action through cluster metadata properties. For more information, see official Dataproc documentation.

Tracking Python Spark jobs

Use the following configuration of the PySpark DataProc job to enable Databand Spark query listener with automatic dataset tracking:

pyspark_operator = DataProcPySparkOperator(
    #...
    dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"],
    dataproc_pyspark_properties={
        "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
        "spark.extraListeners": "ai.databand.spark.DbndSparkListener",
    },
    #...
)

First, publish your JAR to Google Storage.

For the list of all supported operators and extra information, see Tracking remote tasks.

Cluster bootstrap

If you are using a custom cluster installation, you must install Databand packages, agent, and configure environment variables for tracking.

Add the following commands to your cluster initialization script:

#!/bin/bash -x

DBND_VERSION=REPLACE_WITH_DBND_VERSION

## Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR)
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"


## if you use Listeners/Agent, download Databand Agent which includes all JARs
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/

## install Databand Python package together with Airflow support
python -m pip install databand==${DBND_VERSION}

Install the dbnd packages on the Spark main and Spark workers by running pip install databand at bootstrap or manually.
Make sure that you don't install "dbnd-airflow" to the cluster.

How to provide Databand credentials with cluster bootstrap

If your cluster type supports configuring the environment variables with a bootstrap script You can use your bootstrap script to define Databand credentials on the cluster level:

sh -c "echo DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN >> /usr/lib/spark/conf/spark-env.sh"

Be sure to replace <databand-url> and <databand-access-token> with your environment-specific information.

Databand agent path and query listener configuration for Spark operators

Databand can automatically alter the spark-submit command for various Spark operators and inject the Agent JAR into the class path and enable query listener. You can use the following options to configure the dbnd_config airflow connection:

{
    "tracking_spark": {
        "query_listener": true,
        "agent_path": "/home/hadoop/dbnd-agent-latest-all.jar",
        "jar_path": null
    }
}

query_listener - enables Databand Spark query listener for auto-capturing dataset operations from Spark jobs.
agent_path - path to the Databand Java Agent FatJar. If this path is provided, Databand includes this Agent in the Spark Job through the spark.driver.extraJavaOptions configuration option. Agent is required if you want to track Java or Scala jobs that are annotated with @Task. Agent must be placed in a cluster local file system for proper functioning.
jar_path - path to the Databand Java Agent FatJar. If this path is provided, Databand includes the JAR in the Spark Job through the spark.jars configuration option. The JAR can be placed in a local file system or a S3/GCS/DBFS path.

Properties can be configured with environment variables or .cfg files. For more information, see SDK configuration.

Next steps

See the Tracking Python section for implementing dbnd within your PySpark jobs. See the Tracking Spark and JVM applications for Spark and JVM jobs.