Apache Airflow

Databand provides various monitoring, alerting, and analytical functions that help you monitor the health and reliability of your Airflow DAGs (Airflow pipelines). Databand monitors multiple Airflow instances by providing a centralized tracking system for company-wide DAGs.

You can use DAG tracking functions for more visibility into:

  • Metadata from operators
  • Task code, logs, and errors
  • Data processing engines such as dbt and Spark

Go to the data collection cheat sheet, see Collected metadata to check what metadata is tracked and how tracking metadata can be configured.

Architecture of Airflow tracking by Databand

Databand tracks all operators and can capture runtime information from every .execute() call within any Airflow operator. Everything that happens in the boundaries of the .execute() function is tracked, for example:

  • Operator start and end time
  • User metrics emitted from the code
  • User exceptions
  • Source code (optional)
  • Logs (optional)
  • Return value (optional)

You can use all functions from Python, (for more information, see Python) inside your operator implementation the moment Databand is integrated with your cluster.

You can also use Airflow syncer, which syncs execution metadata from the Airflow database.

Some of the operators cause "remote" execution, so the connection between Airflow operator and subprocess execution must be established. Databand supports multiple Spark-related operators, Bash, and some other operators. For more information, see Tracking remote tasks.

Databand Architecture Airflow Sync as DAG. The diagram explains how tracking works in Airflow with an installed Databand package. Information from Airflow is tracked with 2 methods: an Airflow Sync DAG (a DAG called databand_airflow_monitor.py) and a DBND Client that is a set of tools used in each DAG execution. The Airflow Sync DAG is continuously asking the Airflow Database for information about DAG runs and other crucial information and sends the data to the Databand web server. The DBND Client is a Python package that wraps user code and runs during the DAG execution. As a result, real-time data such as logs, dbt data, and Spark data is sent to Databand.

Setting up an Airflow integration

To integrate Databand with your Airflow environment:

  1. Install dbnd-airflow-auto-tracking, which is Databand's runtime tracking Python package, on your Airflow cluster, see Integrating with managed Airflow providers.
  2. Install the Airflow monitor DAG, see Creating the Databand Airflow monitor DAG.
  3. Add and configure the Airflow integration in the Databand application, see Adding and configuring an Airflow integration.

Managed Airflow providers

Databand integrates with Airflow providers such as Apache Airflow, Astronomer, Amazon Managed Workflows, and Google Cloud Composer. When you use a managed provider of Airflow, you need to know which URL to use as the Airflow URL in Databand or how to install Python libraries. For more information, see Installing the dbnd-airflow-auto-tracking package on an Airflow cluster and Special considerations for managed Airflow providers. The document explains how to add an Airflow integration by using the example of self-managed Airflow. Regardless of the selected provider, the remaining steps of integrating Databand with Airflow are performed in the same way.

Learn more