First, let me define what I mean by each of these terms.
Observability is an organization’s ability to easily understand the health of their data pipeline and the quality of the data being moved through those pipelines. Data observability is a blanket term that includes a lot of measures of health, but most notably includes pipeline infrastructure and pipeline metadata tracking.
Data quality involves making sure the data is truly accurate. There are plenty of data quality tools out there to help build extremely granular checks for data.
Alerting is, you guessed it, alerting on issues. Operating in a “silence is good” world is important for productivity and peace of mind. Sending a Slack alert or email when a critical issue arises allows the team to react quickly.
Observability is key to be able to go and quickly understand whether the pipeline execution is in the realm of what’s expected, at every part in the pipeline. However, relying solely on observability for critical issues can result in missed information. After all, humans aren’t always “observing,” they need to sleep eventually.
In the case of data pipelines, observability, data quality and alerting all contribute to measures of success best when used together.
While the tools mentioned above address data quality, they don’t address issues with the infrastructure actually updating the data. Even if there’s no bug in code, data won’t be fresh if the infrastructure it runs on isn’t healthy. If that’s Airflow, guard rails need to be in place to monitor it in parallel to the data itself.
Measuring the health of Airflow relies on understanding some key concepts: DAGs, being entire processes with subcomponents, and tasks, being the subcomponents themselves. The three key infrastructure pieces of Airflow are as follows:
- The worker, which actually does the heavy lifting of executing the tasks.
- The scheduler, which controls which tasks are running, queued and what’s up next.
- The webserver, which runs the UI of Airflow. (Although not great for observability, the webserver can come down anytime and tasks would still run.)
Let’s talk about possible points of failure when it comes to Airflow. Just to name a few:
- Airflow runs on a virtual machine that’s out of memory or other resources.
- Instead of a single virtual machine, Airflow runs on infrastructure that auto-scales (great!) like AWS Elastic Container Service (ECS), but it needs a higher auto-scaling threshold and actually also can be at maximum capacity.
- The number of tasks allowed to run concurrently across the entire Airflow instance is too low, so tasks are stuck in a queued state.
All the points of failure have one thing in common: they impact data freshness. What they don’t have in common are the responses that would remediate the issue. Getting to the bottom of the single point of failure in Airflow, just like troubleshooting data freshness, will expedite time to address the issue.
Now that we’ve explored Airflow’s moving pieces and the tools you’ll need to ensure their health, let’s dive into what to actually measure in more detail.