Lineage Missing for Tables Created by CTAS Statement
Problem
An automated scan of database lineage does not show lineage for tables created using CREATE TABLE AS SELECT (CTAS) statements. For example, if one has the code
CREATE TABLE SALES_BY_REGION AS
SELECT
REGION,
SUM(AMOUNT) AS AMOUNT
FROM DAILY_TRANSACTIONS
WHERE TRANSACTION_TYPE = 'SALES'
GROUP BY REGION;
and scans the tables SALES_BY_REGION
and DAILY_TRANSACTIONS
with Manta, one expects to see the data lineage
DAILY_TRANSACTIONS -> SALES_BY_REGION
. However, this is not the case.
More Details
There is no issue with how Manta Data Lineage processes CTAS statements. If Manta Data Lineage has access to them, they are processed correctly and included in the lineage.
The reason the lineage for these tables is missing is that the CTAS statements are not persisted anywhere in the database as metadata for the table. Even a command such as GET_DDL() (or its equivalent for a specific DB vendor) returns only a CREATE TABLE statement with individual columns and datatypes that will be inferred from the original source table rather than the CTAS statement.
For example:
CREATE TABLE SALES_BY_REGION ()
REGION VARCHAR(200),
AMOUNT DOUBLE
);
As a side note, please be aware of potential issues that may arise with CTAS statements. For example, if a user first creates the table
create table ctas1 as select * from t1
and then modifies it using
alter table ctas1 drop column a
, what should the semantics of the
original DDL be? What if we use drop table ctas1
? The whole concept of original DDL quickly becomes self-defeating. Hence, the logic of get_ddl()
which, as documentation states, Returns a DDL
statement that can be used to recreate the specified object. Once created, t1
and ctas1
can be modified independently without influencing one another. As such, they no longer have any connection, regardless of how they were
created.
Solution
Since the original CTAS statement is not stored as metadata for the table, there is no direct solution to this problem. However, there are a few methods that can be used to establish the lineage.
-
Run CTAS statements in a way that they can be automatically extracted and analyzed by Manta Data Lineage. This is the easiest and the best way to get the lineage covered.
Include the CTAS statement in code that is automatically extracted and analyzed by Manta Data Lineage, for example in a stored procedure. -
Avoid using CTAS statements, and replace them with regular INSERT INTO or CTE (Common Table Expression) as part of your code.
-
Approach #1 with Dynamic SQL
Some platforms (e.g., Oracle) do not allow users to directly place CREATE (TEMPORARY) TABLE statements in procedures and function code. It is possible to work around this by executing the CREATE TABLE or CTAS statement as dynamic SQL via EXECUTE IMMEDIATE. Please note that dynamic SQL generally is not analyzed by Manta Data Lineage; there is limited support for dynamic SQL implemented in the Oracle scanner. For other dialects, it is necessary to convert dynamically-executed SQL to static directly in the code extracted by Manta Data Lineage from the database prior to the dataflow analysis phase or to use the approach described in #4. -
Provide Manta Data Lineage with the CTAS statements as manual input for the database scanner.
Log every CTAS statement in a database table or a file, and then, provide it to Manta Data Lineage in the form of a file as manual input to be scanned. -
Use Open Manta Extensions to ingest the missing lineage.
Fill in the missing lineage with Open Manta Extension import. This approach works best for cases where the transformation logic in the CTAS statement is minimal — ideally just a copy of the table so that the metadata is easy to design.