How to Import Custom Pieces of Lineage

Goal

It is often necessary to import pieces of custom lineage into IBM Manta Data Lineage to provide lineage information on data movement implemented by technologies that Manta Data Lineage does not provide scanners for yet.

This article explains how to connect two supported technologies (Teradata and PostgreSQL) using an unsupported technology (Pentaho) to represent the flow Teradata → Pentaho → PostgreSQL.

Instructions

Create custom objects representing data assets (files, database objects, systems, etc.) or transformation assets (scripts, ETL flows, replications, etc.). The high level process is:

  1. Create "connections" for import scenarios.
  2. Create custom objects.
  3. Connect them to the rest of the assets and to each other with custom links.
  4. Load them into the Manta Data Lineage repository.
  5. Review the results in the Manta Data Lineage UI and log files.
  6. Fix any issues by repeating steps 1–5.

Creating Connections for Imports

You can prepare several pieces of custom lineage and process them separately with different parameters. For this, you need to create a connection for each such import. This can be done through the Admin UI by creating a connection for each group of data.

  1. Create a new connection for the Open Manta Extensions (import scenario). The CONNECTION_ID (manta.import.id property) should be specified there, and it will be needed later.
  2. Create a new connection for the Open Manta Direct links (links import scenario). The CONNECTION_ID ( manta.import.links.id property) should be specified there, and it will be needed later.

Creating Custom Objects

Using custom metadata, you can create custom objects in the Manta repository. These can be used to represent transformation objects or create brand new assets as needed.

Here is an example of how to create a structure representing a transformation object.

  1. Go to the folder mantaflow/cli/input/import/<CONNECTION_ID>. (Create it, if does not exist yet. Use the <CONNECTION_ID> that you specified when creating the Open Manta Extensions connection for object import.)

  2. Create the file layer.csv with contents as follows. (The meaning of each column is described in Open Manta Extensions: Files and Formats.)

    layer.csv

    "1","Physical","Physical"
    
  3. Create the file resource.csv with contents as follows.

    resource.csv

    "2","Pentaho","Pentaho","Pentaho","1"
    
  4. Create the file node.csv with contents as follows.

    node.csv

    /Pentaho/MyProject,,MyProject,Pentaho,2
    /Pentaho/MyProject/MyFolder,/Pentaho/MyProject,MyFolder,Pentaho Folder,2
    /Pentaho/MyProject/MyFolder/MyJob,/Pentaho/MyProject/MyFolder,MyJob,Pentaho Job,2
    /Pentaho/MyProject/MyFolder/MyJob/col1,/Pentaho/MyProject/MyFolder/MyJob,col1,Pentaho Expression,2
    /Pentaho/MyProject/MyFolder/MyJob/col2,/Pentaho/MyProject/MyFolder/MyJob,col2,Pentaho Expression,2
    /Pentaho/MyProject/MyFolder/MyJob/col3,/Pentaho/MyProject/MyFolder/MyJob,col3,Pentaho Expression,2
    
Make sure that the file is delimited by commas.

To connect custom objects to each other and to existing resources in the Manta Data Lineage repository, use the direct links module. Here's an example that shows how to connect the custom object created above to Teradata and PostgreSQL in order to create the flow Teradata → Pentaho → PostgreSQL. Note that if you want to try using this example, it expects the same column names for the source, custom object, and target. Replace the Teradata and PostgreSQL path locations with valid objects that exist in your Manta repository.

  1. Go to mantaflow/cli/input/links/<CONNECTION_ID>. (Create it, if it does not exist yet. Use the <CONNECTION_ID> that you specified when creating Open Manta Direct Links connection for link import.)

  2. Create the file links.csv with "<source>","<target>" link pairs like in the following example.

    links.csv

    "/Teradata/TDPROD1.my.com/PARTY_PKG/contact","/Pentaho/MyProject/MyFolder/MyJob"
    "/Pentaho/MyProject/MyFolder/MyJob/col1","/PostgreSQL/greenplum.my.com/gp_prod/masterdata/contact"
    

    The object path comes from the Manta Data Lineage UI.

Make sure that all the locations exist and the file is delimited by commas.

Loading Custom Metadata and Lineage into Manta Data Lineage

The files created in the previous steps are automatically ingested during the lineage analysis run triggered in Process Manager in Admin UI.

However, for a quick test, it is often more convenient to only run the import manually, done as follows or by creating the following workflow in Process Manager.

  1. New Minor Revision Scenario — to open a new minor revision and add the custom lineage to the last existing revision
  2. Import Dataflow Scenario — to ingest layer.csv, resource.csv, and node.csv
  3. Import Links Dataflow Scenario — to ingest links.csv
  4. Commit Revision Scenario — to persist data into the repository; if you see any errors coming from the previous steps, you can also Rollback Revision Scenario to revert back and start again
  5. Review the custom lineage import logs.

If you are not happy with the result and need to repeat the above steps, simply run Delete Revision Scenario to remove the newly added minor revision and start again. Please note that both the deletion and rollback may take some time to complete based on the repository size.

Review Logs

Any errors reported during lineage import will help you identify issues with the input file format. The logs are located under the LogViewer tab in Admin UI or directly on the filesystem: