Installing JVM SDK and agent

Databand provides a set of Java libraries for tracking JVM-specific applications such as Spark jobs written in Scala or Java.

Follow these instructions to start tracking JVM applications.

Adding Databand libraries to your JVM application

When you add the Databand library to your Spark project, include Databand libraries with whichever tool that you use to deploy your JVM project and its dependencies to the Spark cluster. For example, include Databand libraries in your FatJar.

Maven

Add Databand JVM SDK to your Maven project by adding it as a dependency in your POM file.

  <dependency>
      <groupId>ai.databand</groupId>
      <artifactId>dbnd-client</artifactId>
      <version>0.xx.x</version>
    </dependency>

SBT

Add Databand JVM SDK to your SBT project by adding the following lines to your build.sbt file:

   libraryDependencies += "ai.databand" % "dbnd-client" % "0.xx.x"

Gradle

Add Databand JVM SDK to your SBT project by adding the following line to your dependencies list at build.gradle file:

 compile('ai.databand:dbnd-client:0.xx.x')

Manual

If you don't use a build system, or you're just running PySpark script, and you still want to use Databand JVM libraries for Listeners, you can download and add our JARs to the Spark application manually by using --jars or --packages.

You can use a direct link to Maven repo. If you use Databand JVM for production, download JAR to local or remote storage instead of downloading the JAR agent from the Maven repository. Select the version of Databand that you want and download dbnd-agent-0.xx.x-all.jar from Maven repository. For automation, you can use the following script:

DBND_VERSION=0.XX.X
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/

Store the JAR agent at the location available to your JVM application. In this example, the /home/hadoop/ folder is used, but you can use any other folder. For example, if you run locally, use your own user folder. For usage inside your Spark cluster, you can also publish this JAR to your remote storage, for example Google Storage or S3.

Configuration

Databand JVM SDK uses the same properties as Python SDK. However, not all of them are supported and the methods of configuration are slightly different.

When you configure JVM SDK in Databand, your environment variables are passed through the JVM process. If you use Spark, you can set up variables by using spark.env properties.

## via export
export DBND__CORE__DATABAND_URL=...
## via spark.env
spark-submit ...  --conf "spark.env.DBND__TRACKING=True" ...

Use the --conf approach if you use distributed Spark execution. When you submit your job to the remote cluster, your environment variables are not included in this submission.

The following configuration properties are supported in JVM SDK:

Table 1. List of configuration properties, their default values, and descriptions
Variable Default value Description
DBND__TRACKING__PROJECT default With this property, you can override the project name. The maximum supported length for this property is 100 characters.
DBND__TRACKING False This property explicitly enables tracking and is mandatory. Possible values: True or False. The value must be set to True. When the values are not set or are set to False, tracking is not enabled. Note: When a job is running inside Airflow, you can omit this property.
DBND__CORE__DATABAND_URL Not set This property is mandatory. Tracking is done by Tracker URL.
DBND__CORE__DATABAND_ACCESS_TOKEN Not set This property is mandatory. Tracking is done by access token.
DBND__TRACKING__VERBOSE False When the value is set to True, you enable verbose logging, which can help with debugging agent instrumentation.
DBND__TRACKING__LOG_VALUE_PREVIEW False When the value is set to True, previews for Spark datasets are calculated. This property can impact performance and must be explicitly enabled.
DBND__LOG__PREVIEW_HEAD_BYTES 32768 This value represents the size of the task log head in bytes. When the log size exceeds the value of head+tail, the middle of the log is truncated
DBND__LOG__PREVIEW_TAIL_BYTES 32768 This value represents the size of the task log tail in bytes. When the log size exceeds the value of head+tail, the middle of the log is truncated.
DBND__TRACKING__JOB Spark Application name or main method name or @Task annotation value if it was set With this property, you can override the job name.
DBND__RUN_INFO__NAME Randomly generated string from a predefined list. With this property you, can override the run name.

Minimal Spark configuration

You can define the following environment variables in your Spark context or JVM Job.

  • DBND__CORE__DATABAND_URL - provides a Databand server URL
  • DBND__CORE__DATABAND_ACCESS_TOKEN - provides a Databand server access token
  • DBND__TRACKING=True - enables JVM and Python in place tracking

Configuring a local Spark submit

This method can be used as a way of quickly trying and iterating around Databand configuration at the Spark cluster for POC. This method is not suitable for production. For spark-submit scripts, use spark.env for passing variables:

spark-submit \
    --conf "spark.env.DBND__TRACKING=True" \
    --conf "spark.env.DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL" \
    --conf "spark.env.DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN"

Context properties of Airflow tracking

AIRFLOW_CONTEXT parameters are supported as a part of Airflow integration. These properties must be set for proper connection of JVM task run and the parent Airflow task that triggered execution. For more information, see Tracking remote tasks.

Databand listeners

To use and set up Databand listeners, you must bring an extra package ai.databand:dbnd-client into the runtime of your Spark application by using one of the following methods:

  • If your JVM project was built and integrated with your Spark environment, change your JVM project config.
  • If you want to directly connect JAR to your Spark application, you can use bootstrap and add it to the --jars for your spark-submit or use a direct link to Maven.
  • If you want to use the Spark --packages option: spark-submit --packages "ai.databand:dbnd-client:REPLACE_WITH_VERSION".
  • If your agent is already installed and enabled, you don't need to reference any specific Databand JAR in your JVM project. Our JAR agent already contains all relevant SDK libraries.

Enable the Databand listener explicitly with the Spark command line:

spark-submit ... \
    --conf "spark.sql.queryExecutionListeners=ai.databand.spark.DbndSparkQueryExecutionListener"\
    --conf "spark.extraListeners=ai.databand.spark.DbndSparkListener"

Databand JVM agent

If you want to use JVM agent, you must manually integrate it into your Java application. Download the JVM agent to the available location in the Spark process during the execution. See the preceding instructions.

Your job must be submitted with the following parameter:

spark-submit ... --conf "spark.driver.extraJavaOptions=-javaagent:/opt/dbnd-agent-latest-all.jar

If you have an agent, you can enable Databand listeners without explicitly referencing them in your JVM project. The agent has all the required Databand code in its FatJar (-all.jar file).