In the past decade, DevOps engineers started noticing that they’d get urgent calls from their CEO if the application ever went down. This is how they knew they had become vital. Nowadays, many DataOps engineers are receiving that same honor—and the PagerDuty alerts to match.
This reliance on DataOps and data monitoring is going to increase. With the rise of analytics, machine learning and the importance of data in the functioning of all software as a service, data powers the internet. Pipelines power that data. Yet too few engineers are building those pipelines with data monitoring in mind. When things go down, many DataOps teams are left searching blindly in the dark. Even when things aren’t down, they live in a perpetual state of job failure anxiety.
If we’re going to define it, data quality monitoring is the ongoing process of measuring your data’s fitness for use. It isn’t taking action to address those issues—that’s beyond the scope of monitoring. Monitoring is simply knowing, in great detail, what’s happening within your data pipelines.
Why is data quality monitoring important?
Monitoring data quality is important because issues with data will propagate through the pipeline and the negative effects can cascade. If the source data is tainted, everything that follows will be too. Without the right tools, it’s very difficult to identify the source of the corruption and trace any upstream or downstream processes that have been affected.
Observability is the umbrella term for all the actions around understanding and improving the health of your pipeline, such as tracking, alerting and recommendations. Yet the monitoring part (and the accuracy of the monitoring) are crucial.
Without the awareness that monitoring provides, you can’t take action to influence data quality. Not in any scientific way, at least. It’s tough to troubleshoot and a pipeline without an integrated monitoring tool is similar to a black box—you know what goes in and what comes out, but that’s it. A data monitoring software is what detects the errors or strange transformations and tells you where they’re occurring.
Qualities of an effective data monitoring system
For a data monitoring system to be useful, it must have the following characteristics:
Granular: It should specifically indicate where an issue is occurring and with what code.
Persistent: You should be able to monitor things in a time-series to understand the lineage of datasets or errors.
Automatic: You should have the freedom to set thresholds and utilize machine learning and anomaly detection, reducing the need for active attention.
Ubiquitous: It should cover the entire pipeline, not just one part of the pipeline.
Timely: Monitoring should be timely because late alerts are of little use.
What about using an existing APM tool?
If you’re planning on starting to monitor pipelines and are considering using your existing application performance management (APM) tool, think again. Pipelines are a significantly different entity and you’re not going to get the granularity of data or the metrics you need to understand all four factors of data health. You will be able to extract duration, uptime and some logging information, but you’ll be missing all the necessary and actionable information like data schema changes, granular task information, query costs and other specific metrics.
The challenge with large-scale data pipelines
More complicated transformations, more operators touching the pipelines, and little coordination between operators begets vastly more complex DataOps systems. That’s where we’re at today—too many cooks and no pre-fixed menu to define what’s allowed and what isn’t.
Among the greatest challenges is how many non-technical participants are now reliant upon data pipelines to do their job. Demands pour in from the business side from people—executives, analysts and data scientists—who, through no fault of their own, lack an understanding of the data pipeline architecture. They don’t know the quirks of how the data is moved and stored. Yet they’re the ones deciding what must ultimately be delivered.
This is the major reason why most data science projects fail to make it into production. They lack a common language and fail to involve the data engineer early in the process, during the requirements phase, when making fixes is still cost-effective.
It’s a similar story for machine learning pipelines: Running the model and maintaining the model is more difficult with more people involved and no common language and not enough inter-group processes.
All this makes a case for data pipelines that are modular, more easily debugged and well-monitored; hence the need for a data monitoring software.
Data monitoring best practices
To explain the order of operations you should follow to monitor your data pipeline, we’ve created what we call the data observability pyramid of needs, as pictured. It’s your first data monitoring best practice.
The pyramid begins at the bottom, with the physical layer—are the pipelines executing? Did the Spark job run?—and proceed up into the increasingly theoretical realm. More advanced teams tend to be dealing with more higher-order issues at the top.
Putting best practices into action
To put this pyramid into practice, your data observability system should be checking for these issues in this order:
1. Is data flowing?
2. Is the data arriving within a useful window of time?
3. Is the data complete, accurate and fit?
4. How has it been changing over time? (Also called data lineage)
5. Are the people who require the data actually receiving it?
To manage all of this automatically, data monitoring tools are, of course, available.
Advice on data monitoring tools
Similar to infrastructure as a service in DevOps, it’s best to buy rather than build monitoring tools. There’s a lot that goes into data monitoring and having a data monitoring system that’s maintained and improved can be a big time-savings, and free you to actually manage the pipeline.
Monitoring is most often one feature of a data monitoring service or platform. These data monitoring apps tend to also provide tools for awareness and remediation, such as tracking, alerts and machine learning for anomaly detection.
Which is the best data manager app?
We’re biased, but for data engineers, IBM® Databand® is certainly on the list. We built it to provide full observability for data and machine learning pipelines for all the reasons covered in this article—because when suddenly, your CEO cares to know whether the pipeline is up, it pays to monitor it.
See how IBM Databand provides data pipeline monitoring to quickly detect data incidents like failed jobs and runs so you can handle pipeline growth. If you’re ready to take a deeper look, book a demo today.