Databricks Resource Configuration

Source System Properties

This configuration can be setup by creating a new connection on Admin UI > Connections~~ tab or editing an existing connection in Admin UI > Connections > Databases > Databricks > specific connection**. New connection can also be created via Manta Orchestration API.

Granularity of the IBM Automatic Data Lineage connection for Databricks is one Databricks instance. Use of multiple connections against a single Databricks instance may lead to within-system lineage not be connected properly.

Property name

Description

Example

databricks.system.id

Name of a resource representing this Databricks system. This is an arbitrary string that distinguishes the analyzed Databricks instance from others.

my-databricks-system

databricks.instance.url

Server hostname of the Databricks instance. This refers to the URL that is used to log in to the Databricks instance.

Both the hostname and URL are supported. If the URL contains a port number, it is omitted.

dbc-a6ca523-ab83.cloud.databricks.com

https://dbc-a6ca523-ab83.cloud.databricks.com

databricks.instance.authToken

Authorization token for the Databricks instance. This token is used to access Databricks APIs to retrieve data about notebooks or tables.

To find the authorization token from the Databricks UI, go to User Settings > Access Tokens. You can also generate the token using the Databricks Token API.

dapi07f3ab2c4b081568db69a598d5359ab3

databricks.instance.port

Port for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster.

To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC.

Default value: 443

443

databricks.instance.httpPath

HTTP path for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster.

To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC.

sql/protocolv1/o/7702183245859201/0133-144536-ba0caa3m

databricks.filter.workspace.path.root

A comma-separated-list of root paths to the Databricks workspace. The workspace tree will be further searched for objects (e.g., notebooks) from these paths.

Setting this value to something other than the default could help improve performance if you are only interested in scanning specific notebooks; that is, not all notebooks in the workspace. Set it to the value obtained from the workspace tree in the Databricks UI. The only way to display the complete workspace tree in Automatic Data Lineage is to scan all notebooks. The property currently supports setting only a single value.

Default value: /

/

/path/to/some/notebooks,/path/to/other/notebooks

databricks.filter.schemas.extracted

Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be extracted, provided in the format catalog/schema. Each part is evaluated as a regular expression. Leave blank to extract all catalogs and schemas.

Default value: <blank>

catalog1/schema1,catalog2/schema2,catalog3

databricks.filter.schemas.excluded

Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be excluded, provided in the format catalog/schema. Each part is evaluated as a regular expression.

Default value: .*/information_schema

catalog1/schema1,catalog2/schema2,catalog3

databricks.extraction.method

Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation.

default

Git

agent

Common Scanner Properties

This configuration is common for all Databricks source systems and for all Databricks scenarios, and is configure in Admin UI > Configuration > CLI > Databricks > Databricks Common. It can be overridden on individual connection level.

Property name

Description

Example

databricks.dictionary.mappingFile

Path to automatically generated mappings for Databricks instances

Default value: ${manta.dir.temp}/databricks/databricksDictionaryMantaMapping.csv

${manta.dir.temp}/databricks/databricksDictionaryMantaMapping.csv

databricks.dictionary.mappingManualFile

Path to mappings provided manually for Databricks instances

Default value: ${manta.dir.scenario}/conf/databricksDictionaryMantaMappingManual.csv

${manta.dir.scenario}/conf/databricksDictionaryMantaMappingManual.csv

databricks.extraction.lineage.enabled

Set to true if lineage provided by the Unity Catalog API should be extracted. Otherwise, the Unity Catalog API is not invoked at all.

Only change the property when instructed to by the support. Only disable it when no download of Unity Catalog lineage is expected.

Default value: true

true

false

databricks.analyzer.sql.enabled

Set to true if the notebook commands written in SQL should be parsed and analyzed.

Only change the property when instructed to by the support. Only disable it if some expected lineage missing from the graph.

Default value: false

true

false

databricks.analyzer.unity.catalog.enabled

Set to true if lineage provided by the Unity Catalog API can be used for analysis. This setting only takes effect when Unity Catalog lineage has been extracted ( databricks.extraction.lineage.enabled=true).

Only change the property when instructed to by the support. Only disable it if it is undesirable to display Unity Catalog lineage in the graph.

Default value: true

true

false

databricks.analyzer.unity.catalog.table.lineage.enabled

Set to true if edges between tables for which there is no column-level lineage information available should be generated. Only applies to lineage provided by the Unity Catalog API.

Only change the property when instructed to by the support. Only enable it if some expected lineage is missing from the graph.

Default value: false

true

false

databricks.analyzer.dispatcher.mode

Mode customizing results of the analysis:

  • Filtered results — Only those analysis results that are considered the most precise are used. A typical use case is that the analyzer tries to scan the notebook's code first, and only if it fails, for whatever reason, is the lineage provided by the Unity Catalog API used.

  • All results — All analysis results are used, regardless of their quality. Using this option may cause the produced lineage to be less precise. However, it also decreases the chance that any lineage will be omitted (due to missing runtime values).

Only change the property when instructed to by the support.

Default value: All results

Filtered results

All results

Stream Connection Placeholder

Sometimes dataflow analysis of source code in Databricks notebooks produces many possible string values as a result of more complex string operations. This is particularly problematic when the string contains a stream connection such as a file path. When that happens, a file node is created for each value. If there are too many file nodes, the graph becomes unreadable.

To help with this problem, we have introduced a stream connection placeholder. Instead of creating many nodes with a path that could be incomplete, we use a placeholder (artificial) node that contains all found values in its attribute.

File Path Mapping

File path mapping allows you to map a placeholder node to its real file path which makes the result graph more accurate. To correctly set the file path mapping, see Filesystem Resource Configuration for an explanation. To make it easier to configure, Databricks provides mapping values as node attributes. In the Databricks notebook graph below, note that it was not possible to determine the file path for both input and output - we will focus on mapping the input file path to the correct value. The graph contains two TOO_MANY_POSSIBLE_PATHS placeholder nodes. We can also see that the File path node has Warning saying There were too many possible paths detected. See "Possible paths" attribute for found values. See "Source Path Prefix" attribute for file path mapping RegExp.

No alt text provided

How to Set Up File Path Mapping

Displaying attribute detail using shows escaped backslashes. If you copy values from this element, make sure that in Admin UI, there are only single and double backslashes, as described below.

To create a file path mapping, go to Admin UI, select Configuration from the top menu and then, in the left menu, go to CLI > Common > File path mapping. In the top right corner, select Add line. A form is shown, fill it in followingly:

Using the example above, mapping configuration of the input file would look as follows. Note that it is also possible to map files to Amazon S3 resource. When done, click the Save button in the top right corner. After the configuration, it is necessary to run analysis again for the mapping to take effect.

No alt text provided

The resulting graph with the mapped input resource looks as follows. The same instructions can be used for mapping the output file.

No alt text provided