Open Manta Annotated Script Scanner Usage

The Annotated Script Scanner scans input files for its specific annotations with custom metadata, and it produces lineage based on this metadata.

The Annotated Script Scanner provides custom metadata that contains lineage information directly in the source code in a way that is easily defined and maintained. It can replace Open Manta Extensions: Usage in cases where custom metadata import is used to gather the lineage for unsupported script-based technologies.

Descriptions of the annotations, their usage, and the edge cases are provided in the following sections.

Annotation Global Rules

Escaping

Each key or value in the annotation, except the annotation keyword itself, can contain special characters. The following characters do not need to be escaped.

All other characters are considered special characters. You must escape them by using one of the following methods.

@MANTASQL Annotation

The @MANTASQL annotation serves as an annotation for queries that are to be analyzed by IBM Automatic Data Lineage database scanners (e.g., the Hive scanner or Redshift scanner).

Arguments

Associated query is searched from the line directly after the end of thecomment with the @MANTASQL annotation. The format of how the query is written in the input file is defined by the queryLocators configuration in JSON according to Language Format Configuration JSON File Structure. Only SELECT queries without input parameters are supported.

Example

# script file location: /home/manta/input/analyzedscript.py
def extract_from_db():
    # @MANTASQL name=customernames conn_id=MyOracle
    query = """select firstname, lastname from customers"""

Generated Dataflow Graph

Connection ID Configuration

Connection IDs, along with their parameters, are defined in an external INI connection configuration file located in Manta Flow Client in the Annotated Script Scanner input directory - /input/annotatedscript/connectionsConfiguration.prm.

The purpose is that the developer doesn't have to know the specific values and have the option to use only a particular connection ID. For example:

[Connection ID]
Type=Type of resource
Connection_String=Connection string
Server_Name=Default server name
Database_Name=Default database name
Schema_Name=Default schema name
User_Name=User name

The specifics of the file correspond to the configuration from the Connection Definition Settings section in Informatica PowerCenter Resource Configuration.

@MANTALineage Annotation

The @MANTALineage annotation serves as an abstraction over the standard format of Automatic Data Lineage custom metadata. For more information, see Open Manta Extensions: Usage. You can use it to generate manual lineage for which there isn’t a scanner, and it is purely up to the developer to maintain.

Arguments

The annotation has to precede a compact block of lines containing separate mappings. (See the subsection regarding mapping definitions for more information.)

Example

# @MANTALineage identifier=ident default_tgt_type=OracleTable default_tgt_Database=MyOracleDB default_tgt_Schema=DefaultSchema
# Filesystem:exports/customers.txt/fullname ..> MyOracleDB/DefaultSchema/Customers/FULLNAME
# QueryRef:customernames/FIRSTNAME -> OracleTable:Customers/FIRSTNAME

The mapping in the example represents:

Mapping Definition

Each mapping is written in a single commented line with the following structure.

[SOURCE_RESOURCE_TYPE : ] SOURCE_PATH LINEAGE_EDGE_TYPE [TARET_RESOURCE_TYPE : ] TARGET_PATH

Where:

The resource is generated based on the resource type provided. If the names contain "/" (the separator used in Manta’s custom metadata format), ">" (the lineage definition symbol), or "=" (the parameter-value separator for the annotation header), they have to be escaped or put in quotes.

When you're connecting lineage from a query by using the QueryRef resource type, the results of the query must be unique.

Default Values

For resource types that do not contain an optional element in the hierarchy (for example, (Directory)* in the Filesystem resource type) and do not have repeated segment types (for example, Directory/Directory/File/Column), meaningful defaults can be provided with the names of the arguments being default_[src|tgt]_${node_type}, where src or tgt defines whether the default value is usable as the source or target of the mapping and ${node_type} is the name of the type in the hierarchy (for example, database or table). The name used for ${node_type} is case-sensitive.

If defaults are provided, only the end of the path that contains non-defaults is required with the beginning of the hierarchy that is replaced by defaults. If only some types of the qualified path (for example, the first two) are overridden, only the last values have to be specified, as shown in the example. It is not possible to default the values out of order. For example, default the first and the third node types defined in the hierarchy.

The type keyword in lowercase is reserved for defaulting a resource type.

Generated Dataflow Graph

Each @MANTALineage annotation generates the following dataflow graph.

Example:

# script file location: /home/manta/analyzedscript.py

# @MANTALineage identifier=Customerpush
# OracleTable:DB/Schema/Table/MyColumn -> OracleTable:OracleDB/OracleSchema/OracleTable/TgtColumn

This example generates the following dataflow: DB/Schema/Table/MyColumn -> home/manta/analyzedscript.py/Customerpush/1 TgtColumn -> OracleDB/OracleSchema/OracleTable/TgtColumn.

Resource Types

To resolve the differences between how various database technologies store data, resource types have been introduced to the @MANTALineage annotation. These types provide the Annotated Script Scanner with the necessary hierarchy metadata.

The idea behind this is that the developer does not have to know the specific values and has the option to use only a particular resource type.

The resource types and their required hierarchy are defined in an external CSV configuration file that is located in Manta Flow Client in /scenarios/manta-dataflow-cli/etc/annotatedscriptResourceTypesConfiguration.csv.

File format and examples:

"Entity Type";"Resource Name";"Resource Type";"Hierarchy"
"OracleTable";"Oracle";"Oracle";"Database/Schema/Table/Column"
"Filesystem";"Filesystem";"Filesystem";"Server/(Directory)*/File/Column"

The "/" character is reserved for denoting a path in a hierarchy. You can’t use it to define the name of a resource type.
The "()\*" characters are reserved for defining a “Kleene star” operator - a type in the hierarchy that repeats from 0 to n times.

@MANTAInclude

The @MANTAInclude annotation enables you to designate that another file containing annotations needs to be imported. This can be used when using queries with placeholders - as the queries are not known statically, they are logged in a file with the @MANTASQL annotation and the given file is then included.

The Annotated Script Scanner scans the referenced file. Syntactically, the contents of this file have to follow the prerequisites previously defined (annotations in comments and SQL queries in the correct format according to the configuration) — the files, however, do not have to be valid input language source files. Those files are also read directly, so it is not possible to define any pre-processing.

The lineage is displayed as flowing through the script that was originally analyzed (not the one referenced/included), as that is the physical position of the data flow. The context created by the included file (for example, the queries analyzed) are accessible from the file that was originally analyzed. This allows the definition of the mappings referencing the logged query in the script file that was originally analyzed and they do not have to be generated in the included file.

The included file cannot contain another @MANTAInclude annotation.

Arguments

Example

One example of a use case is as follows.

def extract_from_db(customer_name):
    query = """select birthdate from customers where customer_name = %s"""
    # @MANTAInclude path='customerquery.py'
    append_mantasql(query % customer_name, '/home/mantauser/loggedqueries/customerquery.py', 'queryname', 'hive_conn')  # Append into a file to be scanned, provide connection id
    db_cursor.execute(query, (customer_name, ))  # Prepared statement - customer_name needs to be a tuple