PostgreSQL lineage configuration

To import lineage metadata from PostgreSQL, create a connection, data source definition and metadata import job.

This information applies to IBM Manta Data Lineage service.

PostgreSQL is an open source and customizable object-relational database.

Supported PostgreSQL versions

  • PostgreSQL version 8.2 or newer (remote)

Processed metadata

The following PostgreSQL metadata is processed and displayed on lineage:

  • Data dictionaries
  • Scripts
  • Views
  • Functions (SQL and PL/pgSQL)
  • Custom types

Limitations

The following limitations apply:

  • Assets that use dynamically executed code through the EXECUTE IMMEDIATE statement are not processed.
  • EDB assets with source code obfuscated using the EDB*Wrap utility are not processed (see Obfuscating source code for more details). Unobfuscated source code might be provided manually as PostgreSQL input to enable lineage analysis.

Prerequisite configuration

Before you import lineage metadata, ensure that the following prerequisites are met:

  • You must have the CONNECT rights to connect to each extracted database.
  • You must have the SELECT rights (granted implicitly as a part of the Public role) on the following tables and views in pg_catalog:
    • pg_database
    • pg_namespace
    • pg_class
    • pg_attribute
    • pg_type
    • pg_proc
    • pg_language
    • pg_description
    • pg_constraint

Creating a metadata import asset

Data source connection

To connect to the data source from which you want to import lineage metadata, you need to select a data source definition and a connection. You can create them before you start creating the metadata import, or you can create them when you create and configure the metadata import asset.

Data source definition

Select PostgreSQL as the data source type.

Connection

For connection details, see PostgreSQL connection.

Connection mode

You can connect to PostgreSQL by using one of the following connection modes:

Include and exclude lists

You can include or exclude assets up to the schema level. Provide databases and schemas in the format database/schema. Each part is evaluated as a regular expression. Assets which are added later in the data source will also be included or excluded if they match the conditions specified in the lists. Example values:

  • myDB/: all schemas in myDB database.
  • myDB2/.*: all schemas in myDB2 database.
  • myDB3/mySchema1: mySchema1 schema from myDB3 database.
  • myDB4/mySchema[1-5]: any schema in my myDB4 database with a name that starts with mySchema and ends with a digit between 1 and 5.

External inputs

If you use external PostgreSQL SQL scripts, you can add them in a .zip file as an external input. You can organize the structure of a .zip file as subfolders that represent databases and schemas. After the scripts are scanned, they are added under respective databases and schemas in the selected catalog or project. The .zip file can have the following structure:

    <database_name>
        <schema_name>
           <script_name.sql>
    <database_name>
        <script_name.sql>
    <script_name.sql>
    replace.csv

The replace.csv file contains placeholder replacements for the scripts that are added in the .zip file. For more information about the format, see Placeholder replacements.

Advanced import options

Extract extended attributes
You can extract extended attributes like primary key, unique and referential integrity constraints of columns. By default these attributes are not extracted.
Extraction mode
You can decide which extraction mode to run for the imported metadata. You have the following options:
  • Prefetch: use it for relational databases.
  • Parallel bulk: use it for analytical processing engines.
  • Single-thread: use it to avoid parallelism and large queries during extraction. When you select this mode, performance might be low.
Performance profile
For selected data sources you can choose a performance profile. Depending on your current needs, the lineage metadata import might be faster or more complete. You can choose between the following profiles:
  • Fast: Low time and memory consumption are the priorities in this profile. If your input is large, lineage might not be complete.
  • Balanced: Both performance and lineage completness are important. It is a compromise bewteen the lineage completness and time and memory that is spent on lineage import.
  • Complete: The completness for lineage is the priority in this profile. If your input is large, the lineage import might take a significant amount of resources and time.
  • Custom profile: You can create your own performance profile by providing values for the following properties:
    • Dataflow Analysis Timeout Limit: Specifies the maximum estimated time (in seconds) after which the dataflow analysis of a single input is stopped. The time is checked when each node is added, or in some cases when edges are created. Therefore, in some cases, the timeout might slightly exceed the specified limit. If you set the value to 0, the analysis is not stopped. Example value: 60.
    • Dataflow Analysis Edge Limit: Specifies the maximum number of edges that are allowed for a single input during the dataflow analysis. If this limit is exceeded, all filter edges are removed and no more filter edges are added. If the limit is still exceeded even after that, the analysis is stopped and the input fails. To disable the limit, set the value to 0. Example value: 2500.
Transformation logic extraction
You can enable building transformation logic descriptions from SQL code in SQL scripts.

Learn more