Deduction: How IBM Manta Data Lineage Handles Unresolved References to Database Objects
Overview
When scanning source systems (databases, ETL tools, reporting tools, etc.), Manta Data Lineage relies on information stored in its database dictionary to properly understand what objects and types exist in the database (tables, views, columns, functions, schemas, etc.). This database dictionary is populated during the database scanners' extraction phase and is later used in the dataflow analysis phase by the database and other scanners (ETL, reporting tools, etc.). In cases where the database dictionary is not available (or is outdated), Manta Data Lineage deduction process is utilized. This article explains what the deduction process is, how it works, and its current limitations. It is critical that the database dictionary is available for the databases that are being accessed by ETL, reporting, or other tools or scripts; otherwise, the lineage cannot be automatically stitched together and the results may not be 100% accurate and/or complete as the deduction process cannot be 100% accurate. (See the details below for further explanation as to why this occurs.)
Deduction may also opearte in cases where the target database dictionary is available but Manta Data Lineage was not able to automatically match the connection reference in the particular tool (ETL, reporting, etc.) to the respective Manta Data Lineage connection (and the associated database dictionary). Refer to Alias (Manual) Connection Mappings Explained in such cases.
Description of the Deduction Process
When scanning the source systems, Manta Data Lineage might run into sources that reference objects it doesn’t recognize. In an SQL script, it might appear as in the example below.
SELECT surname FROM core.clients;
If during the extraction there was no object called clients
in the schema core
, Manta Data Lineage might not know what objects the surname
and
core.clients
names refer to.
This can happen for many different reasons: an outdated component might refer to objects that are no longer present in a database; a new component might refer to objects that weren’t scanned by Manta Data Lineage yet; the input might work with dynamically-created objects that aren’t present in the database most of the time (create - use - drop); the names might refer to parts of the system excluded from Manta scans; or one of many other reasons.
When this happens, Manta Data Lineage does not simply skip the unknown objects —instead, it tries to deduce what the unresolved reference is most likely to point to. Objects created this way are marked in the data lineage with the attribute Definition Source set to “DEDUCTION”. As of R42.1, the deduced objects are also marked by the “D” symbol in the graph visualization.
The key takeaway here is that while deduction can allow Manta Data Lineage to provide lineage even with incomplete information, the outputs are generally not as reliable as when the lineage is based on properly recognized, existing objects.
Limitations
As mentioned above, the deduction engine often needs to choose between multiple possible object types to make the deduction. This can lead to a disconnect in the lineage through the deduced object or the deduced object being completely missing. The most common cases are described below.
Incorrectly Deduced Type of an Object
Sometimes, the deduction is rather straightforward.
SELECT surname FROM core.clients;
Here, it’s quite clear that core.clients
refers to a table in the schema core
.
In other cases, it might not be as clear.
SELECT surname FROM clients;
Here, clients
might refer to either a table, a view, or a synonym, and there’s no way to tell for sure unless there’s a known table, view, or synonym of that name. We have to decide based on heuristics. Here, we have deduced that clients
is a table, based on our experience that in most databases, there are more tables than views or synonyms.
The decisions made by the deduction engine are sometimes even more complex and depend on other factors like known related objects or input syntax. In the following Oracle snippet, the utl_xml.create()
is likely a function call.
SELECT utl_xml.create(xmltext) FROM input_queue;
However, in Oracle functions can be part of a schema or a package, and the deduction outcome depends on whether we can resolve ult_xml
to an existing schema or an existing package.
Insert into an Unknown Table with Implicit Columns (Also Known as a “Blind Insert”)
It is tricky to insert into an unknown table — that is, a table that is not in the database dictionary that Manta Data Lineage created during the extraction process or in cases where no dictionary is available because the connection could not be
mapped automatically and no manual connection mapping was performed. In general cases, there is no guarantee that the column names in the query output will be unique or represent the actual table columns. (The insert happens by column order, not
by column name.) In the example below, it might look like some column names from the deduced table rep_client
are present, but first_name
and surname
are the names of candidates
columns and not
necessarily also the names of the INSERT’s target; and for the last two columns, there are simply no column names to use.
The target object will be deduced.
However, on the column level (as described above), it is not possible to reliably deduce column names. Manta Data Lineage will deduce a single
TABLE COLUMNS UNKNOWN
column with all the column-level flows going into it.
Incorrectly Deduced Type of an Object Causing Disconnected Lineage
In the below input, we have deduced clients
as a table and surname
as a column in this table (the upper part of the screenshot). Later, however, we might process a second input (the same or different script) that creates
a view.
SELECT surname FROM core.clients;
CREATE OR REPLACE VIEW core.clients AS
SELECT surname FROM core.parties WHERE party_type = 'client';
In the second input, it’s clear that clients
is a view, so the lineage will lead from core.parties
to the view core.clients
. But the former input resulted in lineage from a (deduced) table
core.clients
. And what was supposed to be a single CLIENTS
object resulted in two separate disconnected deduced objects.
Unqualified Object Names/References
Objects, such as tables, views, procedures, functions, etc., used in SQL queries can be fully-, semi-, or unqualified.
-
An example of a fully-qualified reference to the table/view
customer
where the database (dwh_db
) and schema (core
) are specified when referencing the table in the query.SELECT * FROM dwh_db.core.customer
Note that Hive, Oracle, and Teradata do not support this three-level hierarchy. See the next bullet point.
-
An example of a semi-qualified reference to the table/view
customer
where the schema (core
) is specified when referencing the table in the query but the database specification is missing. The current database will be used when executing the query.SELECT * FROM core.customer
Note that for Hive, Teradata, and Oracle this syntax would actually represent a fully-qualified reference as Teradata does not have schemas. (Objects are placed directly in a database.) Oracle uses different syntax (DB Links) to reference another database.
-
An example of an unqualified reference to the table/view
customer
where no schema or database is specified. The current database and schema will be used when executing the query.SELECT * FROM customer
In cases where the object name is semi- or unqualified and there is no other piece of information providing additional context, Manta Data Lineage may run into ambiguity that often leads to disconnected lineage. This is usually not the case for
scripts that live in the database (such as views and stored procedures) when extracted by Manta Data Lineage (any semi- or unqualified references to the object have the context of current schema/database coming from their location within
the database. However, especially for manually provided script files, the context is missing unless specified in the input folder hierarchy. See the details for individual scanners; for example, Oracle PL/SQL Manual Inputs for Oracle. The logic for handling unqualified object references is technology dependent, and Manta Data Lineage reflects that in the individual scanners. For example, MS SQL searches for objects in the default dbo
schema if not found
in the current schema; Oracle has public synonyms that may be referred to by unqualified names; and Snowflake has a configurable search path that determines which schemas are searched. Manta Data Lineage generally deduces all unqualified
names as the current schema, which can lead to disconnected lineage if a mix of qualified and unqualified names is used. For example:
SET SCHEMA core;
INSERT INTO clients (surname)
SELECT surname FROM source_table;
INSERT INTO target_table (surname)
SELECT surname FROM dbo.clients;
In the first INSERT statement, the clients
table is deduced in the current schema (core
), but in the second one, a qualified name makes it clear that the clients
table should be deduced in the schema dbo
.
Impossible to Determine Object at All
There are cases where the deduction engine won’t be able to deduce an appropriate object. Here are some examples.
-
Selecting from an unknown table using a wildcard
SELECT * FROM employees;
There is no information to base the deduction on in this statement — no column names, not even the column count. Note that in this specific case, some scanners produce a single deduced column named
TABLE COLUMNS UNKNOWN
in the source table in an attempt to preserve at least table-level lineage.This is how
TABLE COLUMNS UNKNOWN
might look for the following input.INSERT INTO target_table (name, surname) SELECT * FROM employees
Table-level lineage:
Column-level lineage:
-
Using an unqualified column name when selecting a column from multiple unknown tables
SELECT name FROM employees LEFT JOIN offices ON (employees.office_id = offices.id)
Although the name of the column is known here, there’s no information as to whether the
name
column should be deduced in theemployees
or theoffices
table.
Deduction in Connections between Source Systems
When a source system (like an ETL or reporting tool) uses a database, some or all the information needed to resolve the database queries may be missing. In such a case, the upper levels of the object hierarchy (server, database, schema) may need to be deduced as default values.
If the information about the server name is missing, DEFAULT_SERVER
or
__dummy__
is often used as a deduced default server name. For a database, DEFAULT_DATABASE
might be used, etc. In some cases where reasonable defaults exist (like dbo
for MS SQL or PUBLIC
for
PostgreSQL), these default names are used where the information is missing.
The missing information can often be provided by adding a manual connection mapping. See Alias (Manual) Connection Mappings Explained.