Manta Flow Base Export User Documentation

This document describes the format of metadata exported from the Manta Flow application and how it can be used for further metadata analysis.

Export Execution

An export can be executed using the Export Repository Scenario in Process Manager and Orchestration API scripts. All data flows are also exported to CSV files stored in the <MANTA_DIR_HOME>/temp/processmanager/${workflowExecutionId}/csv folder and can be downloaded from Process Manager with the same structure as defined in Open Manta Extensions: Files and Formats and Open Manta Extensions: Usage.

Metadata Format

Exported metadata can be stored in CSV files or as tables in a database. This is its database schema.

All objects are exported into a single CSV file located in the mantaflow\cli\temp\csv directory. This single CSV file contains a varying number of columns based on the first object type column. Based on the object type column, the single output file can be split into seven original files. The previous version of the export created (7) CSV files which housed each of the different object types.

The following table lists the translations from file to object type.

Object type File type
layer DF_Layer
resource DF_Resource
node DF_Node
edge DF_Edge
node_attribute DF_Node_Attribute
edge_attribute DF_Edge_Attribute
source_code DF_Source_Code

Each object type has N number of attributes associated with it, which correlate to the columns described in their corresponding file schemas. For example, the layer object type includes attributes (columns) that are described in the preceding DF_Layer CSV file.

Metadata Analysis

This section contains some queries that can be useful for various metadata analyses. Note that all queries are written in Teradata SQL dialect. Also note that all of these queries are provided as examples of how the exports can be queried.

Finding a Fully-Qualified Object

When it is necessary to find an object defined by a fully-qualified name, self-join the DF_Node table with the Parent_Node_Id property. The object resource name is also necessary. For example, it is possible to find the column ExampleDB.ExampleTable.ExampleColumn in a Teradata resource using this query.

select Col.*
from DF_Node Col
  inner join DF_Node Tab on Col.Parent_Node_Id = Tab.Node_Id
  inner join DF_Node Dat on Tab.Parent_Node_Id = Dat.Node_Id
  inner join DF_Resource Res on Dat.Resource_Id = Res.Resource_Id
where
  Res.Resource_Name = 'Teradata' and
  Dat.Node_Name = 'ExampleDB' and
  Tab.Node_Name = 'ExampleTable' and
  Col.Node_Name = 'ExampleColumn'
;

Here is the result.

3      2      ExampleColumn Column 1

Listing All Object Attributes

When the object ID is known (e.g., from a query for finding a fully-qualified object), it is easy to list all its attributes using this query.

select Atr.Attribute_Name, Atr.Attribute_Value
from DF_Node_Attribute Atr
where Atr.Node_Id = 3
;

Here is the result.

ORDER               1
COLUMN_CHARSET      UNICODE
COLUMN_LENGTH       50
COLUMN_TYPE         VARCHAR

Finding All Direct Flows from an Object (One Step)

When the object ID is known (e.g., from a query for finding a fully-qualified object), it is possible to find all objects that a direct flow leads to (from the identified object) in one step. Note that Automatic Data Lineage exports edges on the attribute level, meaning that flow exists between attribute-level nodes—not between any other objects (such as tables).

select Tgt.*
from DF_Edge Edg
inner join DF_Node Src on Src.Node_Id = Edg.Source_Node_Id
inner join DF_Node Tgt on Tgt.Node_Id = Edg.Target_Node_Id
where
Edg.EdgeType = 'DIRECT' and
Src.Node_Id = 3
;

Here is the result.

6      5      1 ExampleColumn      BTEQ ColumnFlow     2
9      8      1 ExampleColumn      BTEQ ColumnFlow     2

Finding All Direct Flows from an Object (Recursive)

If it is necessary to find all direct flows from an object recursively, a recursive statement like this can be created.

with recursive Temp_Node(Node_Id, Depth) as (
  select Col.Node_Id, cast(0 as integer) as Depth
  from DF_Node Col
    inner join DF_Node Tab on Col.Parent_Node_Id = Tab.Node_Id
    inner join DF_Node Dat on Tab.Parent_Node_Id = Dat.Node_Id
    inner join DF_Resource Res on Dat.Resource_Id = Res.Resource_Id
  where
    Res.Resource_Name = 'Teradata' and
    Dat.Node_Name = 'SourceDB' and
    Tab.Node_Name = 'SourceTable' and
    Col.Node_Name = 'SourceColumn' 
union all
  select Tgt.Node_Id, Src.Depth + 1
  from Temp_Node Src
    inner join DF_Edge Edg on Edg.Source_Node_Id = Src.Node_Id
    inner join DF_Node Tgt on Edg.Target_Node_Id = Tgt.Node_Id
  where
    Edg.Edge_Type = 'DIRECT' and
    Src.Depth <= 10 
)
select distinct Node_Id
from Temp_Node;
;

However, this statement must be limited to a maximum number of steps (depth). Note that even this limitation will not necessarily prevent the statement from taking a very long time. Therefore, it is better to write the same function using a stored procedure (which also must be limited) or a recursive BTEQ script.