In the past decade, DevOps engineers started noticing that they’d get urgent calls from their CEO if the application ever went down. This is how they knew they had become vital. Nowadays, many DataOps engineers are receiving that same honor—and the PagerDuty alerts to match.
This reliance on DataOps and data monitoring is going to increase. With the rise of analytics, machine learning and the importance of data in the functioning of all software as a service, data powers the internet. Pipelines power that data. Yet too few engineers are building those pipelines with data monitoring in mind. When things go down, many DataOps teams are left searching blindly in the dark. Even when things aren’t down, they live in a perpetual state of job failure anxiety.
In this guide, we explore the vital importance of data monitoring in DataOps, why it becomes such an issue with large-scale or complex pipelines and share a handful of best practices.
Industry newsletter
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
If we’re going to define it, data quality monitoring is the ongoing process of measuring your data’s fitness for use. It isn’t taking action to address those issues—that’s beyond the scope of monitoring. Monitoring is simply knowing, in great detail, what’s happening within your data pipelines.
Monitoring data quality is important because issues with data will propagate through the pipeline and the negative effects can cascade. If the source data is tainted, everything that follows will be too. Without the right tools, it’s very difficult to identify the source of the corruption and trace any upstream or downstream processes that have been affected.
The terms “monitoring” and “observability” are often used interchangeably, but there’s a distinction: Monitoring is just one piece of observability.
Observability is the umbrella term for all the actions around understanding and improving the health of your pipeline, such as tracking, alerting and recommendations. Yet the monitoring part (and the accuracy of the monitoring) are crucial.
Without the awareness that monitoring provides, you can’t take action to influence data quality. Not in any scientific way, at least. It’s tough to troubleshoot and a pipeline without an integrated monitoring tool is similar to a black box—you know what goes in and what comes out, but that’s it. A data monitoring software is what detects the errors or strange transformations and tells you where they’re occurring.
For a data monitoring system to be useful, it must have the following characteristics:
If you’re planning on starting to monitor pipelines and are considering using your existing application performance management (APM) tool, think again. Pipelines are a significantly different entity and you’re not going to get the granularity of data or the metrics you need to understand all four factors of data health. You will be able to extract duration, uptime and some logging information, but you’ll be missing all the necessary and actionable information like data schema changes, granular task information, query costs and other specific metrics.
More complicated transformations, more operators touching the pipelines, and little coordination between operators begets vastly more complex DataOps systems. That’s where we’re at today—too many cooks and no pre-fixed menu to define what’s allowed and what isn’t.
Among the greatest challenges is how many non-technical participants are now reliant upon data pipelines to do their job. Demands pour in from the business side from people—executives, analysts and data scientists—who, through no fault of their own, lack an understanding of the data pipeline architecture. They don’t know the quirks of how the data is moved and stored. Yet they’re the ones deciding what must ultimately be delivered.
This is the major reason why most data science projects fail to make it into production. They lack a common language and fail to involve the data engineer early in the process, during the requirements phase, when making fixes is still cost-effective.
It’s a similar story for machine learning pipelines: Running the model and maintaining the model is more difficult with more people involved and no common language and not enough inter-group processes.
All this makes a case for data pipelines that are modular, more easily debugged and well-monitored; hence the need for a data monitoring software.
To explain the order of operations you should follow to monitor your data pipeline, we’ve created what we call the data observability pyramid of needs, as pictured. It’s your first data monitoring best practice.
The pyramid begins at the bottom, with the physical layer—are the pipelines executing? Did the Spark job run?—and proceed up into the increasingly theoretical realm. More advanced teams tend to be dealing with more higher-order issues at the top.
To put this pyramid into practice, your data observability system should be checking for these issues in this order:
1. Is data flowing?
2. Is the data arriving within a useful window of time?
3. Is the data complete, accurate and fit?
4. How has it been changing over time? (Also called data lineage)
5. Are the people who require the data actually receiving it?
To manage all of this automatically, data monitoring tools are, of course, available.
Similar to infrastructure as a service in DevOps, it’s best to buy rather than build monitoring tools. There’s a lot that goes into data monitoring and having a data monitoring system that’s maintained and improved can be a big time-savings, and free you to actually manage the pipeline.
Monitoring is most often one feature of a data monitoring service or platform. These data monitoring apps tend to also provide tools for awareness and remediation, such as tracking, alerts and machine learning for anomaly detection.
We’re biased, but for data engineers, IBM® Databand® is certainly on the list. We built it to provide full observability for data and machine learning pipelines for all the reasons covered in this article—because when suddenly, your CEO cares to know whether the pipeline is up, it pays to monitor it.
See how IBM Databand provides data pipeline monitoring to quickly detect data incidents like failed jobs and runs so you can handle pipeline growth. If you’re ready to take a deeper look, book a demo today.
Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.