Installing JVM SDK and agent
Databand provides a set of Java libraries for tracking JVM-specific applications such as Spark jobs written in Scala or Java.
Follow these instructions to start tracking JVM applications.
Adding Databand libraries to your JVM application
When you add the Databand library to your Spark project, include Databand libraries with whichever tool that you use to deploy your JVM project and its dependencies to the Spark cluster. For example, include Databand libraries in your FatJar.
Maven
Add Databand JVM SDK to your Maven project by adding it as a dependency in your POM file.
<dependency>
<groupId>ai.databand</groupId>
<artifactId>dbnd-client</artifactId>
<version>0.xx.x</version>
</dependency>
SBT
Add Databand JVM SDK to your SBT project by adding the following lines to your
build.sbt file:
libraryDependencies += "ai.databand" % "dbnd-client" % "0.xx.x"
Gradle
Add Databand JVM SDK to your SBT project by adding the following line to your dependencies list at build.gradle file:
compile('ai.databand:dbnd-client:0.xx.x')
Manual
If you don't use a build system, or you're just running PySpark script, and you still want to use
Databand JVM libraries for Listeners, you can download and add our JARs to the Spark application manually by using
--jars or --packages.
You can use a direct link to Maven repo. If you use Databand JVM for production, download JAR to
local or remote storage instead of downloading the JAR agent from the Maven repository. Select the
version of Databand that you want and download dbnd-agent-0.xx.x-all.jar from Maven
repository. For automation, you can use the following script:
DBND_VERSION=0.XX.X
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
Store the JAR agent at the location available to your JVM application. In this example, the
/home/hadoop/ folder is used, but you can use any other folder. For example, if you
run locally, use your own user folder. For usage inside your Spark cluster, you can also publish
this JAR to your remote storage, for example Google Storage or S3.
Configuration
Databand JVM SDK uses the same properties as Python SDK. However, not all of them are supported and the methods of configuration are slightly different.
When you configure JVM SDK in Databand, your environment variables are passed through the JVM
process.
If you use Spark, you can set up variables by using spark.env properties.
## via export
export DBND__CORE__DATABAND_URL=...
## via spark.env
spark-submit ... --conf "spark.env.DBND__TRACKING=True" ...
Use the --conf approach if you use distributed Spark execution. When you submit
your job to the remote cluster, your environment variables are not included in this submission.
The following configuration properties are supported in JVM SDK:
| Variable | Default value | Description |
|---|---|---|
DBND__TRACKING__PROJECT
|
default | With this property, you can override the project name. The maximum supported length for this property is 100 characters. |
DBND__TRACKING
|
False
|
This property explicitly enables tracking and is mandatory. Possible values: True or False. The value must be set to True. When the values are not set or are set to False, tracking is not enabled. Note: When a job is running inside Airflow, you can omit this property. |
DBND__CORE__DATABAND_URL
|
Not set | This property is mandatory. Tracking is done by Tracker URL. |
DBND__CORE__DATABAND_ACCESS_TOKEN
|
Not set | This property is mandatory. Tracking is done by access token. |
DBND__TRACKING__VERBOSE
|
False
|
When the value is set to True, you enable verbose logging, which can help with debugging agent instrumentation. |
DBND__TRACKING__LOG_VALUE_PREVIEW
|
False
|
When the value is set to True, previews for Spark datasets are calculated. This property can impact performance and must be explicitly enabled. |
DBND__LOG__PREVIEW_HEAD_BYTES
|
32768
|
This value represents the size of the task log head in bytes. When the log size exceeds the value of head+tail, the middle of the log is truncated |
DBND__LOG__PREVIEW_TAIL_BYTES
|
32768
|
This value represents the size of the task log tail in bytes. When the log size exceeds the value of head+tail, the middle of the log is truncated. |
DBND__TRACKING__JOB
|
Spark Application name or main method name or
@Task
annotation value if it was set |
With this property, you can override the job name. |
DBND__RUN_INFO__NAME
|
Randomly generated string from a predefined list. | With this property you, can override the run name. |
Minimal Spark configuration
You can define the following environment variables in your Spark context or JVM Job.
-
DBND__CORE__DATABAND_URL- provides a Databand server URL -
DBND__CORE__DATABAND_ACCESS_TOKEN- provides a Databand server access token -
DBND__TRACKING=True- enables JVM and Python in place tracking
Configuring a local Spark submit
This method can be used as a way of quickly trying and iterating around Databand configuration at
the Spark cluster for POC. This method is not suitable for production. For
spark-submit scripts, use spark.env for passing variables:
spark-submit \
--conf "spark.env.DBND__TRACKING=True" \
--conf "spark.env.DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL" \
--conf "spark.env.DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN"
Context properties of Airflow tracking
AIRFLOW_CONTEXT parameters are supported as a part of Airflow integration. These properties must be set for proper connection of JVM task run and the parent Airflow task that triggered execution. For more information, see Tracking remote tasks.
Databand listeners
To use and set up Databand listeners, you must bring an extra package
ai.databand:dbnd-client into the runtime of your Spark application by using one of
the following methods:
- If your JVM project was built and integrated with your Spark environment, change your JVM project config.
- If you want to directly connect JAR to your Spark application, you can use bootstrap and add it
to the
--jarsfor yourspark-submitor use a direct link to Maven. - If you want to use the Spark
--packagesoption:spark-submit --packages "ai.databand:dbnd-client:REPLACE_WITH_VERSION". - If your agent is already installed and enabled, you don't need to reference any specific Databand JAR in your JVM project. Our JAR agent already contains all relevant SDK libraries.
Enable the Databand listener explicitly with the Spark command line:
spark-submit ... \
--conf "spark.sql.queryExecutionListeners=ai.databand.spark.DbndSparkQueryExecutionListener"\
--conf "spark.extraListeners=ai.databand.spark.DbndSparkListener"
Databand JVM agent
If you want to use JVM agent, you must manually integrate it into your Java application. Download the JVM agent to the available location in the Spark process during the execution. See the preceding instructions.
Your job must be submitted with the following parameter:
spark-submit ... --conf "spark.driver.extraJavaOptions=-javaagent:/opt/dbnd-agent-latest-all.jar
If you have an agent, you can enable Databand listeners without explicitly referencing them in
your JVM project. The agent has all the required Databand code in its FatJar
(-all.jar file).