Open Manta Annotated Script Scanner Usage

The Annotated Script Scanner scans input files for its specific annotations with custom metadata, and it produces lineage based on this metadata.

The Annotated Script Scanner provides custom metadata that contains lineage information directly in the source code in a way that is easily defined and maintained. It can replace Open Manta Extensions: Usage in cases where custom metadata import is used to gather the lineage for unsupported script-based technologies.

Descriptions of the annotations, their usage, and the edge cases are provided in the following sections.

Annotation Global Rules

Annotations for this scanner are unrelated to any constructs of the specific language. They are represented as regular comments by using language-specific syntax. For example:
- Python: # @MANTA
- Visual Basic: '@MANTA
- Ruby: # @MANTA
Each Annotated Script Scanner annotation starts with the "@MANTA" prefix in a comment.
Annotated Script Scanner annotations are written in standard comment blocks.
- The format of comments is defined in the JSON configuration file in the commentLocators section according to Language Format Configuration JSON File Structure.
- One annotation must be a compact block of commented lines.
  - An uncommented line ends the block.
  - User comments cannot be added directly after the annotation block as they would be taken as part of the annotation.
- Annotation arguments can be written on separate lines.
  - They can be on the first line with the annotation itself.
  - Multiple arguments can be on the same line.
  - One argument cannot be split between multiple lines.
  - The annotation block can contain empty lines if they are commented.

Escaping

Each key or value in the annotation, except the annotation keyword itself, can contain special characters. The following characters do not need to be escaped.

Letters: a..z, A..Z
Numbers: 0..9
Symbols: '_', '.', '#', '$', '@', '-'
Unicode characters: \u0080..\u009F, \u00A1..\u00AB, \u00AD..\uFEFE,\uFF00..\uFFFE

All other characters are considered special characters. You must escape them by using one of the following methods.

For :, ", ', >, =, /, \, (space), escape by using \.
For whole keys or values, they can be put in double quotation marks. If the name contains a double quotation mark, escape by using \. The string within the double-quotes is otherwise taken as is.

@MANTASQL Annotation

The @MANTASQL annotation serves as an annotation for queries that are to be analyzed by IBM Automatic Data Lineage database scanners (e.g., the Hive scanner or Redshift scanner).

Arguments

name
- Defines the name of the result of the query (case-sensitive).
- Used to connect the lineage result with other objects using the @MANTALineage annotation.
- Has to be unique within the script file and any script files included via the @MANTAInclude annotation.
conn_id
- Name of the connection ID.
- Describes a connection that Manta has configured and that will be used for the analysis.
- See the subsection regarding connection IDs for more information.

Associated query is searched from the line directly after the end of thecomment with the @MANTASQL annotation. The format of how the query is written in the input file is defined by the queryLocators configuration in JSON according to Language Format Configuration JSON File Structure. Only SELECT queries without input parameters are supported.

Example

# script file location: /home/manta/input/analyzedscript.py
def extract_from_db():
    # @MANTASQL name=customernames conn_id=MyOracle
    query = """select firstname, lastname from customers"""

Generated Dataflow Graph

The query defined under the @MANTASQL annotation will be analyzed by using Automatic Data Lineage scanners, generating a dataflow graph as usual.
The name of the query node is as provided by the name attribute. The name is placed under the script node that contains the annotation.
Query column nodes are generated under the query node. The column nodes have the same names as the columns of the query result, prefixed with an ordinal as per Automatic Data Lineage standard behavior. Depending on the database specifics, names can be case sensitive or uppercase, which is typical for Oracle configurations.
The previous example generates the following data flow: OracleDB/Schema/CUSTOMERS/FIRSTNAME -> home/manta/input/analyzedscript.py/customernames/1 FIRSTNAME. And a similar one for lastname.

Connection ID Configuration

Connection IDs, along with their parameters, are defined in an external INI connection configuration file located in Manta Flow Client in the Annotated Script Scanner input directory - /input/annotatedscript/connectionsConfiguration.prm.

The purpose is that the developer doesn't have to know the specific values and have the option to use only a particular connection ID. For example:

[Connection ID]
Type=Type of resource
Connection_String=Connection string
Server_Name=Default server name
Database_Name=Default database name
Schema_Name=Default schema name
User_Name=User name

The specifics of the file correspond to the configuration from the Connection Definition Settings section in Informatica PowerCenter Resource Configuration.

@MANTALineage Annotation

The @MANTALineage annotation serves as an abstraction over the standard format of Automatic Data Lineage custom metadata. For more information, see Open Manta Extensions: Usage. You can use it to generate manual lineage for which there isn’t a scanner, and it is purely up to the developer to maintain.

Arguments

identifier
- Defines the name of the lineage block.
- Has to be unique within the script file and any script files included via the @MANTAInclude annotation.
- Mandatory.
default_src_type
- Default resource type for the source of the mapping if not defined explicitly by the mapping.
- Could be QueryRef or one of the configured resource types (see the subsection regarding resource types for more information).
- Optional.
default_tgt_type
- Default resource type for the target of the mapping if not defined explicitly by the mapping.
- Could be QueryRef or one of the configured resource types (see the subsection regarding resource types for more information).
- Optional.
default_[src|tgt]_${node_type}
- Default values for the omitted mapping path segments.
- See the subsection regarding default values for more information.
- Repeating, optional.

The annotation has to precede a compact block of lines containing separate mappings. (See the subsection regarding mapping definitions for more information.)

Example

# @MANTALineage identifier=ident default_tgt_type=OracleTable default_tgt_Database=MyOracleDB default_tgt_Schema=DefaultSchema
# Filesystem:exports/customers.txt/fullname ..> MyOracleDB/DefaultSchema/Customers/FULLNAME
# QueryRef:customernames/FIRSTNAME -> OracleTable:Customers/FIRSTNAME

The mapping in the example represents:

On line 2
- Indirect lineage.
- From the column fullname in the file exports/customers.txt (the resource type Filesystem needs to be defined).
- To the column FULLNAME in the Oracle database table MyOracleDB/DefaultSchema/Customers. The resource type OracleTable needs to be defined, which is used from default_tgt_type.
On line 3
- Direct lineage.
- From FIRSTNAME result from @MANTASQL name=customernames, which has to be defined (the name is case-sensitive and has to follow database specifics; for example, in Oracle, names are typically in uppercase).
- To the column FIRSTNAME in the Oracle database table MyOracleDB/DefaultSchema/Customers. The resource type OracleTable needs to be defined and uses default_tgt_Database and default_tgt_Schema.

Mapping Definition

Each mapping is written in a single commented line with the following structure.

[SOURCE_RESOURCE_TYPE : ] SOURCE_PATH LINEAGE_EDGE_TYPE [TARET_RESOURCE_TYPE : ] TARGET_PATH

Where:

SOURCE_PATH and TARGET_PATH elements have the (PATH_ELEMENT “/” )* PATH_ELEMENT structure.
LINEAGE_EDGE_TYPE is “->“ for direct data lineage or “..>“ for indirect data lineage.

The resource is generated based on the resource type provided. If the names contain "/" (the separator used in Manta’s custom metadata format), ">" (the lineage definition symbol), or "=" (the parameter-value separator for the annotation header), they have to be escaped or put in quotes.

When you're connecting lineage from a query by using the QueryRef resource type, the results of the query must be unique.

Default Values

For resource types that do not contain an optional element in the hierarchy (for example, (Directory)* in the Filesystem resource type) and do not have repeated segment types (for example, Directory/Directory/File/Column), meaningful defaults can be provided with the names of the arguments being default_[src|tgt]_${node_type}, where src or tgt defines whether the default value is usable as the source or target of the mapping and ${node_type} is the name of the type in the hierarchy (for example, database or table). The name used for ${node_type} is case-sensitive.

If defaults are provided, only the end of the path that contains non-defaults is required with the beginning of the hierarchy that is replaced by defaults. If only some types of the qualified path (for example, the first two) are overridden, only the last values have to be specified, as shown in the example. It is not possible to default the values out of order. For example, default the first and the third node types defined in the hierarchy.

The type keyword in lowercase is reserved for defaulting a resource type.

Generated Dataflow Graph

Each @MANTALineage annotation generates the following dataflow graph.

The source objects and target objects are created if necessary.
An intermediary Custom Transformation node is created implicitly, which defines the @MANTALineage annotation where the mapping was found and through which the lineage flows. This node has underlying nodes that describe the lineage that is defined by the annotation represented by ColumnFlow nodes.
The lineage flowa from the source column that is provided by the script -> @MANTALineage annotation -> mapping -> source attribute node into the target node.
The name of the transformation node in the script is as provided by the identifier attribute.
The names of the columns in the intermediary script node is the same as the target columns, prefixed with an ordinal as per Automatic Data Lineage standard behavior.
The lineage is created from the source columns to the intermediary columns to the target columns. If there is indirect lineage, an indirect edge flows from the source to the intermediary node and a direct edge flows from the intermediary node to the target node, as per Automatic Data Lineage standard behavior.

Example:

# script file location: /home/manta/analyzedscript.py

# @MANTALineage identifier=Customerpush
# OracleTable:DB/Schema/Table/MyColumn -> OracleTable:OracleDB/OracleSchema/OracleTable/TgtColumn

This example generates the following dataflow: DB/Schema/Table/MyColumn -> home/manta/analyzedscript.py/Customerpush/1 TgtColumn -> OracleDB/OracleSchema/OracleTable/TgtColumn.

Resource Types

To resolve the differences between how various database technologies store data, resource types have been introduced to the @MANTALineage annotation. These types provide the Annotated Script Scanner with the necessary hierarchy metadata.

The idea behind this is that the developer does not have to know the specific values and has the option to use only a particular resource type.

The resource types and their required hierarchy are defined in an external CSV configuration file that is located in Manta Flow Client in /scenarios/manta-dataflow-cli/etc/annotatedscriptResourceTypesConfiguration.csv.

File format and examples:

"Entity Type";"Resource Name";"Resource Type";"Hierarchy"
"OracleTable";"Oracle";"Oracle";"Database/Schema/Table/Column"
"Filesystem";"Filesystem";"Filesystem";"Server/(Directory)*/File/Column"

The "/" character is reserved for denoting a path in a hierarchy. You can’t use it to define the name of a resource type.
The "()\*" characters are reserved for defining a “Kleene star” operator - a type in the hierarchy that repeats from 0 to n times.

@MANTAInclude

The @MANTAInclude annotation enables you to designate that another file containing annotations needs to be imported. This can be used when using queries with placeholders - as the queries are not known statically, they are logged in a file with the @MANTASQL annotation and the given file is then included.

The Annotated Script Scanner scans the referenced file. Syntactically, the contents of this file have to follow the prerequisites previously defined (annotations in comments and SQL queries in the correct format according to the configuration) — the files, however, do not have to be valid input language source files. Those files are also read directly, so it is not possible to define any pre-processing.

The lineage is displayed as flowing through the script that was originally analyzed (not the one referenced/included), as that is the physical position of the data flow. The context created by the included file (for example, the queries analyzed) are accessible from the file that was originally analyzed. This allows the definition of the mappings referencing the logged query in the script file that was originally analyzed and they do not have to be generated in the included file.

The included file cannot contain another @MANTAInclude annotation.

Arguments

path
- Reference to the included file.
- The base path is taken from the configured includeFileBasePath because it is not necessary to scan those generated files with the rest of the processed scripts, so they should not be placed in the input directory.

Example

One example of a use case is as follows.

def extract_from_db(customer_name):
    query = """select birthdate from customers where customer_name = %s"""
    # @MANTAInclude path='customerquery.py'
    append_mantasql(query % customer_name, '/home/mantauser/loggedqueries/customerquery.py', 'queryname', 'hive_conn')  # Append into a file to be scanned, provide connection id
    db_cursor.execute(query, (customer_name, ))  # Prepared statement - customer_name needs to be a tuple