For a data monitoring system to be useful, it must have the following characteristics:
- Granular: It should specifically indicate where an issue is occurring and with what code.
- Persistent: You should be able to monitor things in a time-series to understand the lineage of datasets or errors.
- Automatic: You should have the freedom to set thresholds and utilize machine learning and anomaly detection, reducing the need for active attention.
- Ubiquitous: It should cover the entire pipeline, not just one part of the pipeline.
- Timely: Monitoring should be timely because late alerts are of little use.
What about using an existing APM tool?
If you’re planning on starting to monitor pipelines and are considering using your existing application performance management (APM) tool, think again. Pipelines are a significantly different entity and you’re not going to get the granularity of data or the metrics you need to understand all four factors of data health. You will be able to extract duration, uptime and some logging information, but you’ll be missing all the necessary and actionable information like data schema changes, granular task information, query costs and other specific metrics.
The challenge with large-scale data pipelines
More complicated transformations, more operators touching the pipelines, and little coordination between operators begets vastly more complex DataOps systems. That’s where we’re at today—too many cooks and no pre-fixed menu to define what’s allowed and what isn’t.
Among the greatest challenges is how many non-technical participants are now reliant upon data pipelines to do their job. Demands pour in from the business side from people—executives, analysts and data scientists—who, through no fault of their own, lack an understanding of the data pipeline architecture. They don’t know the quirks of how the data is moved and stored. Yet they’re the ones deciding what must ultimately be delivered.
This is the major reason why most data science projects fail to make it into production. They lack a common language and fail to involve the data engineer early in the process, during the requirements phase, when making fixes is still cost-effective.
It’s a similar story for machine learning pipelines: Running the model and maintaining the model is more difficult with more people involved and no common language and not enough inter-group processes.
All this makes a case for data pipelines that are modular, more easily debugged and well-monitored; hence the need for a data monitoring software.
Data monitoring best practices
To explain the order of operations you should follow to monitor your data pipeline, we’ve created what we call the data observability pyramid of needs, as pictured. It’s your first data monitoring best practice.
The pyramid begins at the bottom, with the physical layer—are the pipelines executing? Did the Spark job run?—and proceed up into the increasingly theoretical realm. More advanced teams tend to be dealing with more higher-order issues at the top.