Data observability has become one of the hottest topics of the year and for good reason.
Data observability provides an end-to-end view into exactly what’s happening with data pipelines across an organization’s data fabric. And it does so in real-time. That means instead of the CEO getting a frantic call at 3 AM that something is broken, teams can fix issues proactively before they become a bigger problem.
In a recent webinar, we dug into why data observability is so important, what’s needed for data observability and how IBM® Databand® can help. Here’s what you need to know.
Why data observability is so important
Plain and simple, the survey results say it all: Most organizations believe their data is unreliable.
Unfortunately, it’s traditionally been difficult to identify bad data until it’s too late. Unlike when an application goes down and it affects thousands of users immediately, businesses can operate on bad data unknowingly for quite some time. Even internally, your sales team would know right away if a Salesforce dashboard wasn’t loading, but how long would it take them to realize if the dashboard was showing incorrect data? And this can lead to ill-informed decisions that have serious consequences (Unity’s example is just one of many that make this all too real).
Data observability solves this challenge. It monitors data pipelines to ensure complete, accurate and timely delivery of data so that data teams can meet data SLAs and the entire business can trust the data they see.
What’s needed for data observability
Importantly, for data observability to fulfill its promise, organizations need a solution that can provide real-time alerting and work with a number of different tools across the entire data fabric.
And to be clear, data monitoring is not the same as data observability. Data monitoring is very static and reactive, simply showing a limited view of an isolated incident of failure. Data observability, on the other hand, is much more holistic and proactive, uncovering the root case of an issue and its impact on downstream systems.
Achieving this holistic, proactive view comes down to five key steps of data observability:
Pipeline execution: Is data flowing? This is table stakes, and you need to be able to confirm the answer for all the hundreds or thousands of pipelines you’re observing.
Pipeline latency: Is data arriving on time? If you expect a run to take two minutes but it actually took five hours (or even five minutes), you’re in breach of SLAs.
Data structure: Is the data shape valid and complete? If the data includes an old record or an incorrect value, then it’s not accurate and can lead to faulty decision-making.
Data content: Are there significant changes in the data profile? Ending up with six columns when you only expected five is a serious issue.
Data validation: Does the data conform to how it’s being used? If not, then there’s a mismatch in terms of what teams need and what they’re getting.
The first four of these steps should provide real-time alerts to data platform and engineering teams when anything goes wrong, that way they can monitor data SLAs. Meanwhile, the last two of these steps should deliver custom metrics that data analytics and science teams can manage to ensure reliability.
How Databand can help
Databand empowers data platform teams to deliver reliable and trustworthy data. In other words, it allows you to catch bad data before it impacts your business.
Specifically, Databand collects metadata from all key solutions in the modern data stack, builds a historical baseline based on common data pipeline behavior, alerts on anomalies and rules based on deviations and resolves through triage by creating smart communication workflows. In doing so the Databand platform supports process quality (pipeline states, pipeline job performance, pipeline latency), data quality (data structure, data content, data freshness) and impact analysis and lineage (relationships between data and pipelines, maps of causes and impacts). As a result, Databand empowers teams to:
Detect earlier: Pinpoint unknown data incidents and reduce mean time to detection (MTTD) from days to minutes.
Resolve faster: Improve mean time to resolution (MTTR) with incident alerts and routing from weeks to hours.
Deliver trusted data: Enhance reliability and data delivery SLAs by providing visibility into pipeline quality issues.
Now, the power of Databand will be available via a new integration with DataStage CP4D, IBM data fabric’s powerful cloud-native ingestion service. This integration will support:
Real-time detection and alerting on incidents in Datastage flows
360 impact analysis using Databand’s runtime incident lineage to view how DataStage incidents impact downstream data
Historical trends of different DataStage processes to detect anomalies and incidents, removing bad data surprises
Databand in action
To bring it all together, let’s take a look at how Databand makes proactive data observability a reality with two real-life use cases.
Diagnosing a data quality issue with Databand and Apache Airflow
In this case, let’s say we pull in data from New York City, with a column for each borough. But when the data comes through, we see six columns. This is an issue since we know there are actually five boroughs. While this instance is obvious, in the day-to-day handling of large datasets, issues like these typically won’t be so glaring. Databand flags this as a data quality issue, and we can investigate by clicking on the alert.
This will take us to a new screen that shows a pipeline overview, including a DAG view and the associated alerts. From a graphical perspective, this view is the same in Databand as what we see in Airflow.
Databand also provides metrics in a histogram form so we can dive deeper to figure out what’s happening. In this case, we see that even though we were only expecting five columns (one for each of the five boroughs), there’s an unspecified sixth column that got added.
Thanks to Databand, we were alerted right away and can fix the issue before it continues.
Identifying a run duration issue with Databand and DataStage
Next, we have a process that ran in DataStage for which we had previously set an alert if the run exceeds 120 seconds. This time, it took 128 seconds, so Databand issued an alert.
Looking into the DAG view, everything is green, which means all the data was accurate. But we do see that every time we run the process, it collects more data (and we can see the same processes and tasks within both Databand and DataStage). This increase in data collection could indicate why the run stopped delivering on time.
From there, we can see within Databand that there are not enough resources to keep up with the amount of data being ingested constantly. In response, we can create a new template in DataStage and apply it to the same process to increase the computing power.
When we run the process again, the duration now comes in under 120 seconds, and the alert is resolved. Thanks to Databand telling DataStage to add more compute power, we can now have a high level of confidence in delivering this data as expected.
Getting started with proactive data observability
Proactive data observability has never been more important, and its value cannot be overstated.