Databricks Resource Configuration
Source System Properties
This configuration can be setup by creating a new connection on Admin UI > Connections~~ tab or editing an existing connection in Admin UI > Connections > Databases > Databricks > specific connection**. New connection can also be created via Manta Orchestration API.
Granularity of the IBM Automatic Data Lineage connection for Databricks is one Databricks instance. Use of multiple connections against a single Databricks instance may lead to within-system lineage not be connected properly.
|
Property name |
Description |
Example |
|---|---|---|
|
databricks.system.id |
Name of a resource representing this Databricks system. This is an arbitrary string that distinguishes the analyzed Databricks instance from others. |
|
|
databricks.instance.url |
Server hostname of the Databricks instance. This refers to the URL that is used to log in to the Databricks instance. Both the hostname and URL are supported. If the URL contains a port number, it is omitted. |
|
|
databricks.instance.authToken |
Authorization token for the Databricks instance. This token is used to access Databricks APIs to retrieve data about notebooks or tables. To find the authorization token from the Databricks UI, go to User Settings > Access Tokens. You can also generate the token using the Databricks Token API. |
|
|
databricks.instance.port |
Port for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster. To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC. Default value: |
|
|
databricks.instance.httpPath |
HTTP path for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster. To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC. |
|
|
databricks.filter.workspace.path.root |
A comma-separated-list of root paths to the Databricks workspace. The workspace tree will be further searched for objects (e.g., notebooks) from these paths. Setting this value to something other than the default could help improve performance if you are only interested in scanning specific notebooks; that is, not all notebooks in the workspace. Set it to the value obtained from the workspace tree in the Databricks UI. The only way to display the complete workspace tree in Automatic Data Lineage is to scan all notebooks. The property currently supports setting only a single value. Default value: |
|
|
databricks.filter.schemas.extracted |
Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be extracted, provided in the format Default value: <blank> |
|
|
databricks.filter.schemas.excluded |
Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be excluded, provided in the format Default value: |
|
|
databricks.extraction.method |
Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation. |
default Git agent |
Common Scanner Properties
This configuration is common for all Databricks source systems and for all Databricks scenarios, and is configure in Admin UI > Configuration > CLI > Databricks > Databricks Common. It can be overridden on individual connection level.
|
Property name |
Description |
Example |
|---|---|---|
|
databricks.dictionary.mappingFile |
Path to automatically generated mappings for Databricks instances Default value:
|
|
|
databricks.dictionary.mappingManualFile |
Path to mappings provided manually for Databricks instances Default value:
|
|
|
databricks.extraction.lineage.enabled |
Set to true if lineage provided by the Unity Catalog API should be extracted. Otherwise, the Unity Catalog API is not invoked at all. Only change the property when instructed to by the support. Only disable it when no download of Unity Catalog lineage is expected. Default value: |
|
|
databricks.analyzer.sql.enabled |
Set to true if the notebook commands written in SQL should be parsed and analyzed. Only change the property when instructed to by the support. Only disable it if some expected lineage missing from the graph. Default value: |
|
|
databricks.analyzer.unity.catalog.enabled |
Set to true if lineage provided by the Unity Catalog API can be used for analysis. This setting only takes effect when Unity Catalog lineage has been extracted (
Only change the property when instructed to by the support. Only disable it if it is undesirable to display Unity Catalog lineage in the graph. Default value: |
|
|
databricks.analyzer.unity.catalog.table.lineage.enabled |
Set to true if edges between tables for which there is no column-level lineage information available should be generated. Only applies to lineage provided by the Unity Catalog API. Only change the property when instructed to by the support. Only enable it if some expected lineage is missing from the graph. Default value: |
|
|
databricks.analyzer.dispatcher.mode |
Mode customizing results of the analysis:
Only change the property when instructed to by the support. Default value: |
|
Stream Connection Placeholder
Sometimes dataflow analysis of source code in Databricks notebooks produces many possible string values as a result of more complex string operations. This is particularly problematic when the string contains a stream connection such as a file path. When that happens, a file node is created for each value. If there are too many file nodes, the graph becomes unreadable.
To help with this problem, we have introduced a stream connection placeholder. Instead of creating many nodes with a path that could be incomplete, we use a placeholder (artificial) node that contains all found values in its attribute.
File Path Mapping
File path mapping allows you to map a placeholder node to its real file path which makes the result graph more accurate. To correctly set the file path mapping, see Filesystem Resource Configuration for an explanation. To make it easier to configure, Databricks provides mapping values as node attributes. In the Databricks notebook graph below, note that it was not possible to determine the file path for both input and output - we will focus
on mapping the input file path to the correct value. The graph contains two TOO_MANY_POSSIBLE_PATHS placeholder nodes. We can also see that the File path node has Warning saying
There were too many possible paths detected. See "Possible paths" attribute for found values. See "Source Path Prefix" attribute for file path mapping RegExp.

How to Set Up File Path Mapping
To create a file path mapping, go to Admin UI, select Configuration from the top menu and then, in the left menu, go to CLI > Common > File path mapping. In the top right corner, select Add line. A form is shown, fill it in followingly:
-
Source Technologyshould be set toDATABRICKS. -
Source Connection IDis the unique ID of your Databricks notebook that created the node (provided in the node detail). -
Source Hostnameislocalhost. -
Source Path Prefixcontains a regular expression that should match the file prefix. You should use the path of the file node starting withTOO_MANY_POSSIBLE_PATHS. Then, you need to escape special path characters using the following rules.-
Any of these characters
< ( [ { ^ - = $ ! | ] } ) ? * + . >should be prefixed with a backslash; for example,.→\.,[→\[. -
The backslash itself should be converted to four backslashes:
\→\\. -
Example: For the file node path
/Filesystem/localhost/TOO_MANY_POSSIBLE_PATHS/GLOBAL 43:0 [-1030055803], you should set this value to:-
Linux installations:
TOO_MANY_POSSIBLE_PATHS/GLOBAL 43:0 \[\-1030055803\] -
Windows installations:
TOO_MANY_POSSIBLE_PATHS\\GLOBAL 43:0 \[\-1030055803\]
-
-
This value is provided in the Source Path Prefix attribute (the Source Path… attribute that can be seen in the following screenshot) for that particular
file node. -

-
-
Target Resourceis the resource type of the mapped node (usually set toFILESYSTEM). -
Target Hostnameis the hostname of the target. This value can be left empty. -
Target Path Prefixis a path in the resulting graph; for example,folder_name/filename.txt. -
More information about these attributes can be found in
Filesystem Resource Configuration.
Using the example above, mapping configuration of the input file would look as follows. Note that it is also possible to map files to Amazon S3 resource. When done, click the Save button in the top right corner. After the configuration, it is necessary to run analysis again for the mapping to take effect.

The resulting graph with the mapped input resource looks as follows. The same instructions can be used for mapping the output file.
