Installing on a Spark cluster

Learn how to install Databand on your Spark cluster.

Before you begin, make sure that you can access the Databand server from your Spark cluster.

Python integration

You must define the following variables in your Spark context.

  • DBND__CORE__DATABAND_URL - a Databand server URL
  • DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server access token
  • DBND__TRACKING=True
  • DBND__ENABLE__SPARK_CONTEXT_ENV=True

Install dbnd-spark. See Installing the Python SDK and the Cluster bootstrap.

JVM integration

The following environment variables must be defined in your Spark context:

  • DBND__CORE__DATABAND_URL - a Databand server URL

  • DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server access token

  • DBND__TRACKING=True

    For more information about the available parameters, see Installing JVM SDK and agent. Your cluster must have Databand JAR to be able to use listener and other features. For more information about how to configure Spark clusters, see the following section.

Spark clusters

With most clusters, you can set up Spark environment variables through cluster metadata or bootstrap scripts. In the following sections, you can find provider-specific instructions.

EMR cluster

Complete the following steps to install Databand on your EMR cluster:

  1. Use environment variables to configure Databand:

    Define the environment variables at the API call or EmrCreateJobFlowOperator Airflow Operator. Alternatively, you can provide these variables from AWS UI when you create a new cluster. The EMR cluster doesn't have a way of defining environment variables in the bootstrap. Consult with the official EMR documentation on Spark Configuration if you use a custom operator or if you create clusters outside of Airflow.

    from airflow.hooks.base_hook import BaseHook
    
    dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
    databand_url = dbnd_config["core"]["databand_url"]
    databand_access_token = dbnd_config["core"]["databand_access_token"]
    
    emr_create_job_flow = EmrCreateJobFlowOperator(
        job_flow_overrides={
            "Name": "<EMR Cluster Name>",
            #...
            "Configurations": [
                {
                    "Classification": "spark-env",
                    "Configurations": [
                        {
                            "Classification": "export",
                            "Properties": {
                                "DBND__TRACKING": "True",
                                "DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
                                "DBND__CORE__DATABAND_URL": databand_url,
                                "DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
                            },
                        }
                    ],
                }
            ]
        }
        #...
    )
  2. Install Databand on your cluster:

    Use the following snippet to install Python and JVM integrations:

    #!/usr/bin/env bash
    
    DBND_VERSION=REPLACE_WITH_DBND_VERSION
    
    sudo python3 -m pip install pandas==1.2.0 pydeequ==1.0.1 databand==${DBND_VERSION}
    sudo wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/

    Add this script to your cluster bootstrap actions list. For more information, see bootstrap actions documentation.

Databricks cluster

  1. Use environment variables to configure Databand:

    In the cluster configuration screen, click Edit > Advanced Options > Spark. Inside the Environment Variables section, declare the listed configuration variables. Be sure to replace <databand-url> and <databand-access-token> with your environment-specific information:

    • DBND__TRACKING="True"
    • DBND__ENABLE__SPARK_CONTEXT_ENV="True"
    • DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL"
    • DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"
  2. Install the Python dbnd library in your Databricks cluster:

    Under the Libraries tab of your cluster's configuration:

    1. Click Install New.
    2. Choose PyPI.
    3. Enter databand==REPLACE_WITH_DBND_VERSION as the Package name.
    4. Click Install.
  3. Install the Python dbnd library for your specific Airflow operator:

    Do not use this mode in production, use it only for trying out dbnd in specific tasks. Make sure that the dbnd library is installed on the Databricks cluster by adding databand to the libraries parameter of the DatabricksSubmitRunOperator, as shown in the following example:

    DatabricksSubmitRunOperator(
         #...
         json={"libraries": [
            {"pypi": {"package": "databand==REPLACE_WITH_DBND_VERSION"}},
        ]},
    )
    
  4. Track Scala or Java Spark jobs
    1. Download dbnd agent and place it into your DBFS working folder.
    2. First, make sure that you publish the agent to /dbfs/apps/. For more configuration options, see the Databricks Runs Submit API documentation.
    3. Configure Tracking Spark applications with automatic dataset logging by adding ai.databand.spark.DbndSparkQueryExecutionListener as a spark.sql.queryExecutionListeners and add ai.databand.spark.DbndSparkListener as a spark.extraListeners. This mode works only if you enable the dbnd agent.
    4. Use the following configuration of the Databricks job to enable Databand Java Agent with automatic dataset tracking:
      spark_operator = DatabricksSubmitRunOperator(
          json={
              #...
              "new_cluster": {
                  #...
                     "spark_conf": {
                      "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
                      "spark.extraListeners": "ai.databand.spark.DbndSparkListener",
                      "spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
                  },
              },
              #...
          })

GoogleCloud DataProc cluster

Complete the following steps to set up your GoogleCloud DataProc cluster:

  1. Set up your cluster.
    Define environment variables during the cluster setup or add these variables to your bootstrap as described in the previous sections about Spark clusters.
    from airflow.hooks.base_hook import BaseHook
    
    dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
    databand_url = dbnd_config["core"]["databand_url"]
    databand_access_token = dbnd_config["core"]["databand_access_token"]
    
    cluster_create = DataprocClusterCreateOperator(
         # ...
         properties={
            "spark-env:DBND__TRACKING": "True",
            "spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
            "spark-env:DBND__CORE__DATABAND_URL": databand_url,
            "spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
        },
        # ...
    )
  2. Use the same operator to install Databand PySpark support:
    cluster_create = DataprocClusterCreateOperator(
         #...
         properties={
                 "dataproc:pip.packages": "dbnd==REPLACE_WITH_DATABAND_VERSION", 
         }
         #...
    )
  3. The Dataproc cluster supports initialization. The following script installs Databand libraries and sets up the environment variables that are required for tracking:

    #!/usr/bin/env bash
    
    DBND_VERSION=REPLACE_WITH_DBND_VERSION
    
    ## to use conda-provided python instead of system one
    export PATH=/opt/conda/default/bin:${PATH}
    
    python3 -m pip install pydeequ==1.0.1 databand==${DBND_VERSION}
    
    DBND__CORE__DATABAND_ACCESS_TOKEN=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_ACCESS_TOKEN)
    sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=${DBND__CORE__DATABAND_ACCESS_TOKEN} >> /usr/lib/spark/conf/spark-env.sh"
    DBND__CORE__DATABAND_URL=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_URL)
    sh -c "echo DBND__CORE__DATABAND_URL=${DBND__CORE__DATABAND_URL} >> /usr/lib/spark/conf/spark-env.sh"
    sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
    sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
    Note:

    Variables like access token and tracker url can be passed to the initialization action through cluster metadata properties. For more information, see the official DataProc documentation.

  4. Publish your JAR to Google Storage, then use the following configuration of the PySpark DataProc job to enable Databand Spark query listener with automatic dataset tracking:
    pyspark_operator = DataProcPySparkOperator(
        #...
        dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"],
        dataproc_pyspark_properties={
            "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
            "spark.extraListeners": "ai.databand.spark.DbndSparkListener",
        },
        #...
    )

For the list of all supported operators and extra information, see Tracking remote tasks.

Cluster bootstrap

If you are using a custom cluster installation, you must install Databand packages, agent, and configure environment variables for tracking.

  1. Set up your cluster:
    1. Add the following commands to your cluster initialization script:
      #!/bin/bash -x
      
      DBND_VERSION=REPLACE_WITH_DBND_VERSION
      
      ## Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR)
      sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
      sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
      
      
      ## if you use Listeners/Agent, download Databand Agent which includes all JARs
      wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
      
      ## install Databand Python package together with Airflow support
      python -m pip install databand==${DBND_VERSION}
    2. Install the dbnd packages on the Spark main and Spark workers by running pip install databand at bootstrap or manually.
    3. Make sure that you don't install "dbnd-airflow" to the cluster.
  2. Provide Databand credentials for your cluster bootstrap:
    If your cluster type supports configuring the environment variables with a bootstrap script You can use your bootstrap script to define Databand credentials on the cluster level. Be sure to replace
    • <databand-url> and <databand-access-token> with your environment-specific information.
    sh -c "echo DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL >> /usr/lib/spark/conf/spark-env.sh"
    sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN >> /usr/lib/spark/conf/spark-env.sh"
  3. Configure the Databand agent path and query listener for Spark operators:

    Databand can automatically alter the spark-submit command for various Spark operators and inject the Agent JAR into the class path and enable query listener. You can use the following options to configure the dbnd_config airflow connection:

    {
        "tracking_spark": {
            "query_listener": true,
            "agent_path": "/home/hadoop/dbnd-agent-latest-all.jar",
            "jar_path": null
        }
    }
    query_listener
    Enables Databand Spark query listener for auto-capturing dataset operations from Spark jobs
    agent_path
    Th path to the Databand Java Agent FatJar. If this path is provided, Databand includes this Agent in the Spark Job through the spark.driver.extraJavaOptions configuration option. An agent is required if you want to track Java or Scala jobs that are annotated with @Task. The agent must be placed in a cluster local file system for proper functioning.
    jar_path
    The path to the Databand Java Agent FatJar. If this path is provided, Databand includes the JAR in the Spark Job through the spark.jars configuration option. The JAR can be placed in a local file system or a S3/GCS/DBFS path.

Properties can be configured with environment variables or .cfg files. For more information, see SDK configuration.

Next steps

See the Tracking Python section for implementing dbnd within your PySpark jobs. See the Tracking Spark and JVM applications for Spark and JVM jobs.