How to Import Custom Pieces of Lineage
Goal
It is often necessary to import pieces of custom lineage into IBM Manta Data Lineage to provide lineage information on data movement implemented by technologies that Manta Data Lineage does not provide scanners for yet.
This article explains how to connect two supported technologies (Teradata and PostgreSQL) using an unsupported technology (Pentaho) to represent the flow Teradata → Pentaho → PostgreSQL.
Instructions
Create custom objects representing data assets (files, database objects, systems, etc.) or transformation assets (scripts, ETL flows, replications, etc.). The high level process is:
- Create "connections" for import scenarios.
- Create custom objects.
- Connect them to the rest of the assets and to each other with custom links.
- Load them into the Manta Data Lineage repository.
- Review the results in the Manta Data Lineage UI and log files.
- Fix any issues by repeating steps 1–5.
Creating Connections for Imports
You can prepare several pieces of custom lineage and process them separately with different parameters. For this, you need to create a connection for each such import. This can be done through the Admin UI by creating a connection for each group of data.
- Create a new connection for the Open Manta Extensions (import scenario). The
CONNECTION_ID
(manta.import.id
property) should be specified there, and it will be needed later. - Create a new connection for the Open Manta Direct links (links import scenario). The
CONNECTION_ID
(manta.import.links.id
property) should be specified there, and it will be needed later.
Creating Custom Objects
Using custom metadata, you can create custom objects in the Manta repository. These can be used to represent transformation objects or create brand new assets as needed.
Here is an example of how to create a structure representing a transformation object.
-
Go to the folder
mantaflow/cli/input/import/<CONNECTION_ID>
. (Create it, if does not exist yet. Use the<CONNECTION_ID>
that you specified when creating the Open Manta Extensions connection for object import.) -
Create the file
layer.csv
with contents as follows. (The meaning of each column is described in Open Manta Extensions: Files and Formats.)layer.csv
"1","Physical","Physical"
-
Create the file
resource.csv
with contents as follows.resource.csv
"2","Pentaho","Pentaho","Pentaho","1"
-
Create the file
node.csv
with contents as follows.node.csv
/Pentaho/MyProject,,MyProject,Pentaho,2 /Pentaho/MyProject/MyFolder,/Pentaho/MyProject,MyFolder,Pentaho Folder,2 /Pentaho/MyProject/MyFolder/MyJob,/Pentaho/MyProject/MyFolder,MyJob,Pentaho Job,2 /Pentaho/MyProject/MyFolder/MyJob/col1,/Pentaho/MyProject/MyFolder/MyJob,col1,Pentaho Expression,2 /Pentaho/MyProject/MyFolder/MyJob/col2,/Pentaho/MyProject/MyFolder/MyJob,col2,Pentaho Expression,2 /Pentaho/MyProject/MyFolder/MyJob/col3,/Pentaho/MyProject/MyFolder/MyJob,col3,Pentaho Expression,2
Creating Custom Links
To connect custom objects to each other and to existing resources in the Manta Data Lineage repository, use the direct links module. Here's an example that shows how to connect the custom object created above to Teradata and PostgreSQL in order to create the flow Teradata → Pentaho → PostgreSQL. Note that if you want to try using this example, it expects the same column names for the source, custom object, and target. Replace the Teradata and PostgreSQL path locations with valid objects that exist in your Manta repository.
-
Go to
mantaflow/cli/input/links/<CONNECTION_ID>
. (Create it, if it does not exist yet. Use the<CONNECTION_ID>
that you specified when creating Open Manta Direct Links connection for link import.) -
Create the file
links.csv
with "<source>","<target>"
link pairs like in the following example.links.csv
"/Teradata/TDPROD1.my.com/PARTY_PKG/contact","/Pentaho/MyProject/MyFolder/MyJob" "/Pentaho/MyProject/MyFolder/MyJob/col1","/PostgreSQL/greenplum.my.com/gp_prod/masterdata/contact"
The object path comes from the Manta Data Lineage UI.
Loading Custom Metadata and Lineage into Manta Data Lineage
The files created in the previous steps are automatically ingested during the lineage analysis run triggered in Process Manager in Admin UI.
However, for a quick test, it is often more convenient to only run the import manually, done as follows or by creating the following workflow in Process Manager.
New Minor Revision Scenario
— to open a new minor revision and add the custom lineage to the last existing revisionImport Dataflow Scenario
— to ingestlayer.csv
,resource.csv
, andnode.csv
Import Links Dataflow Scenario
— to ingestlinks.csv
Commit Revision Scenario
— to persist data into the repository; if you see any errors coming from the previous steps, you can alsoRollback Revision Scenario
to revert back and start again- Review the custom lineage import logs.
If you are not happy with the result and need to repeat the above steps, simply run Delete Revision Scenario
to remove the newly added minor revision and start again. Please note that both the deletion and rollback may take some time
to complete based on the repository size.
Review Logs
Any errors reported during lineage import will help you identify issues with the input file format. The logs are located under the LogViewer tab in Admin UI or directly on the filesystem:
mantaflow/cli/logs/importDataflowMasterScenario.log
— Review the errors fornode_attribute.csv
parsing errors. The most common issues are related to invalid file structure or typos in the file.mantaflow/cli/logs/importLinksDataflowMasterScenario.log
— Review the errors for custom link creation. The most common issues are invalid syntax or incorrect (non-existent) object paths for a source or target of the linked object.mantaflow/server/manta-dataflow-server-dir/logs/manta-dataflow.log
— Review the errors for object references. The most common issues are related to the import of attributes to objects that do not exist in the Manta Data Lineage repository (e.g., typos in the object path, objects that no longer exist).