Data lineage seems to be a hot topic for data platform teams. In this blog, we’re going to walk you through how IBM® Databand® provides automated data lineage so you can easily diagnose pipeline failures and analyze downstream impacts.
Watch the video to see Databand in action or continue reading for detailed information.
Analyze alerts
Utilizing automated data lineage typically begins with an alert. You can jump right into a lineage graph, but it’s important to first know why the graph is relevant.
For example, on the Databand alert screen, you can see all data incidents and their alerts in a consolidated view.
This particular alert shows a critical alert fired in our “daily_sales_ingestion” pipeline, a vital business pipeline that processes our daily sales from SAP, does some transformations for different regions, and then sends it over into a BI layer.
Needless to say, this pipeline is critical for our business since it processes sales from around the world and eventually shows the results to the business.
To diagnose the alert, select ‘View Details’, and now you are on an alert overview screen.
Understand impacted datasets
Before seeing the lineage graph, you can see the impact analysis across your affected datasets, pipelines and operations.
View data lineage
Once you’ve seen what has been impacted, you can now visualize these impacts by selecting the data lineage tab. This graph shows all the dependent relationships between the initial pipeline that failed and any other dependencies that are impacted.
For example, we’re looking at tasks that are writing to a particular dataset and that same dataset being read by a subsequent task. All the red text in each pipeline represents anything that was impacted by the initial failed task.
Let’s zoom to the specific pipeline that failed. Here you can see the specific task named “extract_regional_sales_to_S3” failed the pipeline.
By selecting the failed task, you can see which specific downstream datasets or tasks are impacted with a highlighted red box.
Each time you select a different task, the graph will change which boxes display.
For example, if you select the dataset named “S3 – North America Daily SAP Sales Extract” a lot of red text still remains, but the red boxes have changed.
This indicates that the “S3 – North America Daily SAP Sales Extract” dataset only impacts the highlighted red boxes downstream.
You’ll notice that this dataset had no dependencies on a downstream pipeline in the EU or Asia, but does have dependencies in the North America pipeline labeled “na_sentiment_impact_analysis” and the “serve_sales_results_to_bi” pipeline that serves our BI layer.
Quicky debug data incident
And to make debugging easier, you can jump directly to a task from the data lineage graph. Now you can see the error that caused the pipeline to fail.
This allows you to quickly debug errors and resolve them before any downstream impacts occur.
Achieving automated data lineage
See how Databand helps break communication silos and get the whole data story with end-to-end data lineage. If you’re ready to take a deeper look, book a demo today.