Apache Airflow
Databand provides various monitoring, alerting, and analytical functions that help you monitor the health and reliability of your Airflow DAGs (Airflow pipelines). Databand monitors multiple Airflow instances by providing a centralized tracking system for company-wide DAGs.
You can use DAG tracking functions for more visibility into:
- Metadata from operators
- Task code, logs, and errors
- Data processing engines such as dbt and Spark
Go to the data collection cheat sheet, see Collected metadata to check what metadata is tracked and how tracking metadata can be configured.
Architecture of Airflow tracking by Databand
Databand tracks all operators and can capture runtime information from every
.execute() call within any Airflow operator. Everything that happens in the
boundaries of the .execute() function is tracked, for example:
- Operator start and end time
- User metrics emitted from the code
- User exceptions
- Source code (optional)
- Logs (optional)
- Return value (optional)
You can use all functions from Python, (for more information, see Python) inside your operator implementation the moment Databand is integrated with your cluster.
You can also use Airflow syncer, which syncs execution metadata from the Airflow database.
Some of the operators cause "remote" execution, so the connection between Airflow operator and subprocess execution must be established. Databand supports multiple Spark-related operators, Bash, and some other operators. For more information, see Tracking remote tasks.
Setting up an Airflow integration
To integrate Databand with your Airflow environment:
- Install
dbnd-airflow-auto-tracking, which is Databand's runtime tracking Python package, on your Airflow cluster, see Integrating with managed Airflow providers. - Install the Airflow monitor DAG, see Creating the Databand Airflow monitor DAG.
- Add and configure the Airflow integration in the Databand application, see Adding and configuring an Airflow integration.
Managed Airflow providers
Databand integrates with Airflow providers such as Apache Airflow, Astronomer, Amazon Managed
Workflows, and Google Cloud Composer. When you use a managed provider of Airflow, you need to know
which URL to use as the Airflow URL in Databand or how to install Python libraries. For more
information, see Installing the dbnd-airflow-auto-tracking package on an Airflow
cluster and Special considerations for managed Airflow providers. The document explains how
to add an Airflow integration by using the example of self-managed Airflow. Regardless of the
selected provider, the remaining steps of integrating Databand with Airflow are performed in the
same way.