Spark and JVM applications
Databand offers many options to cover a wide set of Spark appliances.
Spark applications can deploy in many ways, so make sure you use the integration that is best suited for you. Currently, Databand supports Python, Scala, and Java Spark applications. Databand also collects Spark-specific metadata such as data access metrics and Spark execution logs. With Databand, you have greater visibility into your Spark execution, with your pipeline or orchestration systems.
Tracking PySpark
The most basic tracking option requires you to include the
dbnd package in your application and use
dbnd_tracking():
from dbnd import log_metric, task
from pyspark.sql import SparkSession
from operator import add
@task
def calculate_counts(input_file, output_file):
spark = SparkSession.builder.appName("PythonWordCount").getOrCreate()
lines = spark.read.text(input_file).rdd.map(lambda r: r[0])
counts = (
lines.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1)).reduceByKey(add)
)
counts.saveAsTextFile(output_file)
output = counts.collect()
log_metric("counts", len(output))
For more information about advanced configuration settings, see Tracking PySpark guide.
Tracking Scala and Java Spark applications
To get insights on your Scala and Java app, you need to include the dbnd-client
JAR in your application and use DbndLogger.logMetric,
DbndLogger.logDatasetOperation, or other useful methods:
...
object GenerateReports {
@Task
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("example").getOrCreate
val df: DataFrame = spark.read.option("header","true").csv("/data/daily_data")
DbndLogger.logDatasetOperation(path, READ, df)
}
}
For the full guide, see Tracking Spark with Scala or Java.
Automatic tracking of dataset operations with the Databand query execution listener
Databand can capture Spark I/O operations. The Databand query execution listener captures any read/write operation that is performed by Spark and tracks the following metrics: data path, schema, and rows count as a dataset operation. Follow Dataset tracking for more details. For the installation details, check Installing JVM SDK and agent.
Data sources supported by the Databand query execution listener
The Databand query execution listener supports the following data sources:
- Regular files on any storage (local file system, S3, GCS or any other)
- Hive tables
Dataset operations that are not automatically tracked by the Databand query execution listener
The Databand query execution listener does not track the debugging and development dataset operations that only print data into the standard output.
The Databand query execution listener tracks only the production code that stores results in the output files.
-
Example snippet of the code that is not tracked by the Databand query execution listener. The read and writing on standard output operations are not reported by Databand.
source_path = 'dbfs:/FileStore/MyData.json' mydata = spark.read.json(source_path) mydata.show() -
Example snippet of the code that is tracked by the Databand query execution listener. The read and write operations are reported by Databand.
source_path = 'dbfs:/FileStore/MyData.json' target_path = 'dbfs:/FileStore/MyOutput.json' mydata = spark.read.json(source_path) mydata.write.format('json').save(target_path)
Track your data quality by using the Deequ library
Deequ is a library for measuring data quality that is built as an addition to Spark. Deequ provides a DSL for "unit-testing" your data. Databand can capture any metrics that are produced during Deequ profiling. Histograms that are generated by Deequ during profiling are also reported to Databand.
For PySpark, see Tracking PySpark, and for JVM Spark, see Tracking Spark (Scala/Java).
Deployment-specific guides
See Installing JVM SDK and agent.
Databand also supports advanced tracking for following cluster types:
- EMR
- Databricks
- Dataproc
For more information, see Installing on a Spark cluster.