Deduction: How IBM Manta Data Lineage Handles Unresolved References to Database Objects

Overview

When scanning source systems (databases, ETL tools, reporting tools, etc.), Manta Data Lineage relies on information stored in its database dictionary to properly understand what objects and types exist in the database (tables, views, columns, functions, schemas, etc.). This database dictionary is populated during the database scanners' extraction phase and is later used in the dataflow analysis phase by the database and other scanners (ETL, reporting tools, etc.). In cases where the database dictionary is not available (or is outdated), Manta Data Lineage deduction process is utilized. This article explains what the deduction process is, how it works, and its current limitations. It is critical that the database dictionary is available for the databases that are being accessed by ETL, reporting, or other tools or scripts; otherwise, the lineage cannot be automatically stitched together and the results may not be 100% accurate and/or complete as the deduction process cannot be 100% accurate. (See the details below for further explanation as to why this occurs.)

Deduction may also opearte in cases where the target database dictionary is available but Manta Data Lineage was not able to automatically match the connection reference in the particular tool (ETL, reporting, etc.) to the respective Manta Data Lineage connection (and the associated database dictionary). Refer to Alias (Manual) Connection Mappings Explained in such cases.

Description of the Deduction Process

When scanning the source systems, Manta Data Lineage might run into sources that reference objects it doesn’t recognize. In an SQL script, it might appear as in the example below.

SELECT surname FROM core.clients;

If during the extraction there was no object called clients in the schema core, Manta Data Lineage might not know what objects the surname and core.clients names refer to.

This can happen for many different reasons: an outdated component might refer to objects that are no longer present in a database; a new component might refer to objects that weren’t scanned by Manta Data Lineage yet; the input might work with dynamically-created objects that aren’t present in the database most of the time (create - use - drop); the names might refer to parts of the system excluded from Manta scans; or one of many other reasons.

When this happens, Manta Data Lineage does not simply skip the unknown objects —instead, it tries to deduce what the unresolved reference is most likely to point to. Objects created this way are marked in the data lineage with the attribute Definition Source set to “DEDUCTION”. As of R42.1, the deduced objects are also marked by the “D” symbol in the graph visualization.

No alt text provided

The key takeaway here is that while deduction can allow Manta Data Lineage to provide lineage even with incomplete information, the outputs are generally not as reliable as when the lineage is based on properly recognized, existing objects.

Limitations

As mentioned above, the deduction engine often needs to choose between multiple possible object types to make the deduction. This can lead to a disconnect in the lineage through the deduced object or the deduced object being completely missing. The most common cases are described below.

Incorrectly Deduced Type of an Object

Sometimes, the deduction is rather straightforward.

SELECT surname FROM core.clients;

Here, it’s quite clear that core.clients refers to a table in the schema core.

In other cases, it might not be as clear.

SELECT surname FROM clients;

Here, clients might refer to either a table, a view, or a synonym, and there’s no way to tell for sure unless there’s a known table, view, or synonym of that name. We have to decide based on heuristics. Here, we have deduced that clients is a table, based on our experience that in most databases, there are more tables than views or synonyms.

The decisions made by the deduction engine are sometimes even more complex and depend on other factors like known related objects or input syntax. In the following Oracle snippet, the utl_xml.create() is likely a function call.

SELECT utl_xml.create(xmltext) FROM input_queue;

However, in Oracle functions can be part of a schema or a package, and the deduction outcome depends on whether we can resolve ult_xml to an existing schema or an existing package.

Insert into an Unknown Table with Implicit Columns (Also Known as a “Blind Insert”)

It is tricky to insert into an unknown table — that is, a table that is not in the database dictionary that Manta Data Lineage created during the extraction process or in cases where no dictionary is available because the connection could not be mapped automatically and no manual connection mapping was performed. In general cases, there is no guarantee that the column names in the query output will be unique or represent the actual table columns. (The insert happens by column order, not by column name.) In the example below, it might look like some column names from the deduced table rep_client are present, but first_name and surname are the names of candidates columns and not necessarily also the names of the INSERT’s target; and for the last two columns, there are simply no column names to use.

The target object will be deduced.

No alt text provided

However, on the column level (as described above), it is not possible to reliably deduce column names. Manta Data Lineage will deduce a single TABLE COLUMNS UNKNOWN column with all the column-level flows going into it.

No alt text provided

Incorrectly Deduced Type of an Object Causing Disconnected Lineage

In the below input, we have deduced clients as a table and surname as a column in this table (the upper part of the screenshot). Later, however, we might process a second input (the same or different script) that creates a view.

SELECT surname FROM core.clients;

CREATE OR REPLACE VIEW core.clients AS
SELECT surname FROM core.parties WHERE party_type = 'client';

In the second input, it’s clear that clients is a view, so the lineage will lead from core.parties to the view core.clients. But the former input resulted in lineage from a (deduced) table core.clients. And what was supposed to be a single CLIENTS object resulted in two separate disconnected deduced objects.

No alt text provided

Unqualified Object Names/References

Objects, such as tables, views, procedures, functions, etc., used in SQL queries can be fully-, semi-, or unqualified.

In cases where the object name is semi- or unqualified and there is no other piece of information providing additional context, Manta Data Lineage may run into ambiguity that often leads to disconnected lineage. This is usually not the case for scripts that live in the database (such as views and stored procedures) when extracted by Manta Data Lineage (any semi- or unqualified references to the object have the context of current schema/database coming from their location within the database. However, especially for manually provided script files, the context is missing unless specified in the input folder hierarchy. See the details for individual scanners; for example, Oracle PL/SQL Manual Inputs for Oracle. The logic for handling unqualified object references is technology dependent, and Manta Data Lineage reflects that in the individual scanners. For example, MS SQL searches for objects in the default dbo schema if not found in the current schema; Oracle has public synonyms that may be referred to by unqualified names; and Snowflake has a configurable search path that determines which schemas are searched. Manta Data Lineage generally deduces all unqualified names as the current schema, which can lead to disconnected lineage if a mix of qualified and unqualified names is used. For example:

SET SCHEMA core;

INSERT INTO clients (surname)
SELECT surname FROM source_table;

INSERT INTO target_table (surname)
SELECT surname FROM dbo.clients;

In the first INSERT statement, the clients table is deduced in the current schema (core), but in the second one, a qualified name makes it clear that the clients table should be deduced in the schema dbo.

No alt text provided

Impossible to Determine Object at All

There are cases where the deduction engine won’t be able to deduce an appropriate object. Here are some examples.

Deduction in Connections between Source Systems

When a source system (like an ETL or reporting tool) uses a database, some or all the information needed to resolve the database queries may be missing. In such a case, the upper levels of the object hierarchy (server, database, schema) may need to be deduced as default values.

If the information about the server name is missing, DEFAULT_SERVER or __dummy__ is often used as a deduced default server name. For a database, DEFAULT_DATABASE might be used, etc. In some cases where reasonable defaults exist (like dbo for MS SQL or PUBLIC for PostgreSQL), these default names are used where the information is missing.

The missing information can often be provided by adding a manual connection mapping. See Alias (Manual) Connection Mappings Explained.