Microsoft Azure Databricks lineage configuration

To import lineage metadata from Microsoft Azure Databricks, create a connection, data source definition and metadata import job.

This information applies to IBM Manta Data Lineage service.
Lineage import from Microsoft Azure Databricks data source is available in Cloud Pak for Data 5.1.1 and later.

To import lineage metadata for Microsoft Azure Databricks, complete these steps:

  1. Create a data source definition.
  2. Create a connection to the data source in a project.
  3. Create a metadata import.

Creating a data source definition

Create a data source definition. Select Microsoft Azure Databricks as the data source type.

Creating a connection to Microsoft Azure Databricks

Create a connection to the data source in a project. For connection details, see Microsoft Azure Databricks connection.

Creating a metadata import

Create a metadata import. Learn more about options that are specific to Microsoft Azure Databricks data source:

Include and exclude lists

You can include or exclude assets up to the schema level. Provide catalogs and schemas in the format catalog/schema. Each part is evaluated as a regular expression. Assets which are added later in the data source will also be included or excluded if they match the conditions specified in the lists. Example values:

  • myCatalog/: all schemas in myCatalog,
  • myCatalog/.*: all schemas in myCatalog,
  • myCatalog3/mySchema1: mySchema1 from myCatalog3,
  • myCatalog4/mySchema[1-5]: any schema in my myCatalog4 with a name that starts with mySchema and ends with a digit between 1 and 5

External inputs

If you use external Microsoft Azure Databricks dll archives, you can add them in a .zip file as an external input. You can organize the structure of the .zip file as the dll folder with subfolders or archives that represent the workspace structure. The .zip file can have the following structure:

<dll>
    <catalog_name_folder>
      <schema_name_folder>
        <tables>
          <table_name.sql>
        <views>
          <view_name.sql>

Advanced import options

Performance profile
For selected data sources you can choose a performance profile. Depending on your current needs, the lineage metadata import might be faster or more complete. You can choose between the following profiles:
  • Fast: Low time and memory consumption are the priorities in this profile. If your input is large, lineage might not be complete.
  • Balanced: Both performance and lineage completness are important. It is a compromise bewteen the lineage completness and time and memory that is spent on lineage import.
  • Complete: The completness for lineage is the priority in this profile. If your input is large, the lineage import might take a significant amount of resources and time.
  • Custom profile: You can create your own performance profile by providing values for the following properties:
    • Dataflow Analysis Timeout Limit: Specifies the maximum estimated time (in seconds) after which the dataflow analysis of a single input is stopped. The time is checked when each node is added, or in some cases when edges are created. Therefore, in some cases, the timeout might slightly exceed the specified limit. If you set the value to 0, the analysis is not stopped. Example value: 60.
    • Dataflow Analysis Edge Limit: Specifies the maximum number of edges that are allowed for a single input during the dataflow analysis. If this limit is exceeded, all filter edges are removed and no more filter edges are added. If the limit is still exceeded even after that, the analysis is stopped and the input fails. To disable the limit, set the value to 0. Example value: 2500.
Display table lineage
Generate edges between tables for which the column-level lineage information was not found.

Learn more

Parent topic: Supported connectors for lineage import