By the time you realize you have a problem, it’s typically too late. That is, a small problem has now grown into a big problem—one so large that you and others actually notice. And whatever you’ve noticed is, unfortunately, only a downstream effect, and probably one of many. To catch that root issue and keep it from recurring, you have to trace the error all the way back through your data pipeline architecture and that’s where things become tricky.
Without a firm sense of the leading indicators of data pipeline error, you’ll catch errors late. The longer errors exist in your pipeline, the more problems they cause. The more problems they cause, the busier you are addressing all the downstream issues, which keeps you from addressing the root causes. You will forever be fighting fires.
Take this simple example. If it takes an hour to process one day’s worth of data, and you notice an error three days late, you’re now backed up not just one hour, but three hours. If you’re lucky, the issue is reversible, and only internal employees are complaining. If you’re not so lucky, as is the case if your pipeline feeds an e-commerce system that’s now generated a stream of irreversible transactions, you also have customers to deal with. Problems beget more problems.
As we will argue, the cost of catching errors late is substantial. But the biggest casualty by far will be everyone’s trust in the data itself. And by extension, you.
Scenarios where leading indicators can save your pipeline & your reputation
Errors in your data pipeline architecture can have real costs. If your “consumer” is an internal sales team whose dashboards are all down, that’s one kind of cost. But if your data customer is external, the cost is greater and can cut into your brand reputation. If you’re the upstream data provider and customers now can’t run their machine learning models, you’ve now exported your data error to other companies like a bug, and they’re also now facing that same compounding cost of error.
We might call this the catchup multiplier effect: The longer it takes to fix a data pipeline architecture issue, the more the cleanup cost compounds.
IBM discovered this problem in the early 2000s. In a study published in the National Institute of Standards and Technology (NIST), engineers demonstrated that an issue that would take one hour to solve if caught immediately, would take 15 hours if caught at the design phase, and 100 hours if not caught until after launch.