Getting ETL job lineage (Watson Knowledge Catalog)

Capture end-to-end data lineage for ETL jobs. Add business data lineage to data integration assets and, optionally, to imported data assets in your catalogs, and access detailed technical data lineage in MANTA Automated Data Lineage.

Lineage for ETL jobs depicts the order of activities within a job, optionally including the database tables that the job reads from or writes to. Also, lineage shows the flow of data to or from a selected data asset through a job into databases and business intelligence (BI) reports.

The import option is not available for projects that are marked as sensitive.

Before you import metadata, design your metadata import to ensure that you understand all your options and make appropriate choices for your goals. For more information, see Designing metadata imports.

Instead of the user interface, you can also use APIs to retrieve the list of supported connections or to create a metadata import asset. The links to these APIs are listed in the Learn more section.

To run metadata enrichment on data assets that were added with an ETL job lineage import, make the data assets available in a project. For more information, see Adding catalog assets to a project.

Asset types

Data integration assets that represent components of ETL jobs. Optionally, data assets that serve as the source or target in an ETL job. For more information, see Asset types created through metadata import.

These asset types are created starting in Cloud Pak for Data 4.7.2.

Supported connections

See the Metadata import (lineage) column and the Other data sources section in Supported connectors.

Required permissions

To create, manage, or run a metadata import, you must have the following roles and permissions:

  • The Manage asset discovery user permission.
  • The Admin or the Editor role in the project.
  • The Admin or the Editor role in the catalog to which you want to import the assets.
  • Access to the connections to the data sources of the data assets to be imported and the SELECT or a similar permission on the corresponding databases.

Prerequisites

Before you generate and import an ETL job lineage, complete the prerequisite tasks:

  • For InfoSphere DataStage, Talend, or Informatica PowerCenter ETL jobs, create an ETL job file and upload it to your project. For more information, see Preparing ETL job files.
  • For DataStage flows (DataStage on Cloud Pak for Data), select flows from your project.

Creating a metadata import asset and generating or importing lineage metadata for ETL jobs

To create a metadata import asset and a job for generating and importing technical and lineage metadata for ETL jobs:

  1. Open a project, go to the project's Asset page and click New asset > Metadata Import.

  2. Select the Get ETL lineage option. If you don't see this option, the Advanced metadata import feature is not enabled and no license key is installed. For more information, see Installed features and license requirements.

  3. Specify a name for the metadata import. Optionally, provide a description.

  4. Optional: To simplify searching, select tags to be assigned to the metadata import asset. To create new tags, enter the tag name and press Enter.

  5. In Cloud Pak for Data 4.7.0 and 4.7.1, you can't select a target for this type of import. The resulting lineage information is available in MANTA Automated Data Lineage.

    Starting with Cloud Pak for Data 4.7.2, you can select a catalog as the import target. For this type of import, the import target can be only a catalog. For more information, see Scope of import. Select a catalog from the list.

    If your project is marked as sensitive, you can't create and run ETL job lineage imports.

  6. Define a scope for the metadata import. For more information, see Scope of import.

    1. Select the ETL job input for the import.

      1. Click Select file to pick an ETL job file from your project for the import. You can select only one file at a time. For more information, see Preparing ETL job files.

        An ETL job file is static. You can't update its content for a later rerun of the metadata import. You must create a new metadata import to work with a new version of the ETL job file.

      2. After you select an ETL job file, select the data integration tool to make the proper file structure known.

      If you want to capture the lineage of DataStage flows that exist in your project, you have the following options:

      • To select individual flows, click Select file and select the flows that you want to import. You can select more than one flow for a single metadata import. In Cloud Pak for Data 4.7.0 and 4.7.1, use this option only for DataStage flows that don't have any dependencies other than connections. Starting in Cloud Pak for Data 4.7.2, all DataStage flow dependencies are automatically included in the scope.
      • To select all DataStage flows, use the Select all DataStage flows and their dependencies in the project option. By using this option, you include all DataStage flows in your project in the scope and skip the step for individual selection.
    2. Optional: Select the source and target assets that are associated with the ETL job to include technical and lineage metadata for them.

      You can select connections that exist in the project. Also, you can click Create a new connection and create a connection asset. You can import metadata and lineage from the data sources that are listed in Supported connectors.

    3. Review the selected scope.

  7. Define whether you want to run scheduled import jobs. If you don't set a schedule, you run the import when you save the metadata import asset. You can rerun the import manually at any time. For more information, see Scheduling options.

  8. Optional: Customize the import behavior. You can choose to prevent specific properties from being updated and to delete existing assets that are not included in the reimport. For more information, see Advanced import options.

    You can set advanced options for this type of metadata import starting in Cloud Pak for Data 4.7.2.

  9. Review the metadata import configuration. To make changes, click the edit (edit icon) icon on the tile and update the settings.

  10. Click Create. The metadata import asset is added to the project. A metadata import job is created. If you didn't configure a schedule, the import is run immediately. If you configured a schedule, the import runs on the defined schedule.
    The following information applies only if you work with Cloud Pak for Data 4.7.2 or later.

    Important: If the ETL job or set of DataStage flows was already imported through a different metadata import, it is not imported again but is updated. The data integration assets no longer show up in the initial metadata import. Only the most recently run metadata import contains the assets.

    If you select to add your data assets to your ETL job lineage import, the same is true for those.

Import results

Lineage imports are long-running processes. Don't expect immediate results.

Import results in Cloud Pak for Data 4.7.0 and 4.7.1

When the import job is complete, lineage information is available in MANTA Automated Data Lineage. You can access that information in the MANTA Automated Data Lineage UI. Depending on the type of the ETL job, you might need different information:

  1. Get the necessary information:
    • For DataStage flows, you need the ID of your metadata import asset to locate the lineage information. To identify this ID, open the metadata import asset. The asset ID is part of the URL: https://<hostname>/gov/metadata-imports/<asset_ID>?project_id=<project_ID>.
    • For legacy DataStage or Talend ETL jobs, you need the asset ID of the ETL job file that is used for the import. To identify this ID, open the ETL job file by clicking its name on the projects Assets page. The asset ID is part of the URL: https://<hostname>/projects/<project_ID>/data-assets/<asset_ID>.
  2. To open the lineage viewer in MANTA Automated Data Lineage, open the metadata import asset and click the link in the result area. Alternatively, you can enter the following URL in a new browser window. Replace hostname with the hostname of your Cloud Pak for Data deployment.
    https://<hostname>/manta-dataflow-server/viewer
    
  3. Expand the entry for your data source. For example, DataStage.
  4. Locate the entry for your metadata import asset or your ETL job file, which is the asset ID with the suffix _lineage, and expand it.
  5. Select the elements for which you want to view lineage information and click Visualize.

Import results in Cloud Pak for Data 4.7.2 and later

When the import is complete, you can view the list of imported assets with the following information:

  • The asset name, which provides a link to the asset in the catalog.
  • The asset type, such as Data integration job. For data assets, also the format, such as Relational table, is displayed.
  • The date and time that the asset was last imported.
  • The import status, which can be Imported for successfully imported data, In progress, or Removed if the asset couldn't be reimported.

When the import is complete, the imported assets and their business data lineage are available in the catalog that you selected as target. The imported lineage is available on the asset's Lineage tab. Extra lineage information is available in MANTA Automated Data Lineage. You can access that information through the Go to asset's technical data lineage link in the About the asset panel.

Depending on the outcome of the metadata import job run, a completion message or an error notification is displayed.

A completion message is displayed when the job run completed successfully, completed with warnings, or completed with errors. An error notification is displayed if the entire job run failed. Either type of notification contains a link to the job run log that provides details about the specific job run.

Learn more

Next steps

Parent topic: Importing metadata