Databricks Resource Configuration

Source System Properties

This configuration can be setup by creating a new connection on Admin UI > Connections~~ tab or editing an existing connection in Admin UI > Connections > Databases > Databricks > specific connection**. New connection can also be created via Manta Orchestration API.

Granularity of the IBM Automatic Data Lineage connection for Databricks is one Databricks instance. Use of multiple connections against a single Databricks instance may lead to within-system lineage not be connected properly.

Property name	Description	Example
databricks.system.id	Name of a resource representing this Databricks system. This is an arbitrary string that distinguishes the analyzed Databricks instance from others.	`my-databricks-system`
databricks.instance.url	Server hostname of the Databricks instance. This refers to the URL that is used to log in to the Databricks instance. Both the hostname and URL are supported. If the URL contains a port number, it is omitted.	`dbc-a6ca523-ab83.cloud.databricks.com` `https://dbc-a6ca523-ab83.cloud.databricks.com`
databricks.instance.authToken	Authorization token for the Databricks instance. This token is used to access Databricks APIs to retrieve data about notebooks or tables. To find the authorization token from the Databricks UI, go to User Settings > Access Tokens. You can also generate the token using the Databricks Token API.	`dapi07f3ab2c4b081568db69a598d5359ab3`
databricks.instance.port	Port for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster. To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC. Default value: `443`	`443`
databricks.instance.httpPath	HTTP path for the Databricks cluster. This value is used when establishing the JDBC connection with the Databricks cluster. To find the required value from the Databricks UI, go to Compute > Cluster > Configuration > Advanced Options > JDBC/ODBC.	`sql/protocolv1/o/7702183245859201/0133-144536-ba0caa3m`
databricks.filter.workspace.path.root	A comma-separated-list of root paths to the Databricks workspace. The workspace tree will be further searched for objects (e.g., notebooks) from these paths. Setting this value to something other than the default could help improve performance if you are only interested in scanning specific notebooks; that is, not all notebooks in the workspace. Set it to the value obtained from the workspace tree in the Databricks UI. The only way to display the complete workspace tree in Automatic Data Lineage is to scan all notebooks. The property currently supports setting only a single value. Default value: `/`	`/` `/path/to/some/notebooks,/path/to/other/notebooks`
databricks.filter.schemas.extracted	Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be extracted, provided in the format `catalog/schema`. Each part is evaluated as a regular expression. Leave blank to extract all catalogs and schemas. Default value: <blank>	`catalog1/schema1,catalog2/schema2,catalog3`
databricks.filter.schemas.excluded	Limit the extracted catalog schemas by explicitly specifying a comma-separated list of catalogs and schemas to be excluded, provided in the format `catalog/schema`. Each part is evaluated as a regular expression. Default value: `.*/information_schema`	`catalog1/schema1,catalog2/schema2,catalog3`
databricks.extraction.method	Set to Agent:default when the desired extraction method is the default Manta Extractor Agent, set to Agent:{remote_agent_name} when a remote Agent is the desired extraction method, set to Git:{git.dictionary.id} when the Git ingest method is the desired extraction method. For more information on setting up a remote extractor Agent please refer to the Manta Flow Agent Configuration for Extraction documentation. For additional details on configuring a Git ingest method, please refer to the Manta Flow Agent Configuration for Extraction:Git Source documentation.	default Git agent

Common Scanner Properties

This configuration is common for all Databricks source systems and for all Databricks scenarios, and is configure in Admin UI > Configuration > CLI > Databricks > Databricks Common. It can be overridden on individual connection level.

Property name	Description	Example
databricks.dictionary.mappingFile	Path to automatically generated mappings for Databricks instances Default value: `${manta.dir.temp}/databricks/databricksDictionaryMantaMapping.csv`	`${manta.dir.temp}/databricks/databricksDictionaryMantaMapping.csv`
databricks.dictionary.mappingManualFile	Path to mappings provided manually for Databricks instances Default value: `${manta.dir.scenario}/conf/databricksDictionaryMantaMappingManual.csv`	`${manta.dir.scenario}/conf/databricksDictionaryMantaMappingManual.csv`
databricks.extraction.lineage.enabled	Set to true if lineage provided by the Unity Catalog API should be extracted. Otherwise, the Unity Catalog API is not invoked at all. Only change the property when instructed to by the support. Only disable it when no download of Unity Catalog lineage is expected. Default value: `true`	`true` `false`
databricks.analyzer.sql.enabled	Set to true if the notebook commands written in SQL should be parsed and analyzed. Only change the property when instructed to by the support. Only disable it if some expected lineage missing from the graph. Default value: `false`	`true` `false`
databricks.analyzer.unity.catalog.enabled	Set to true if lineage provided by the Unity Catalog API can be used for analysis. This setting only takes effect when Unity Catalog lineage has been extracted ( `databricks.extraction.lineage.enabled=true`). Only change the property when instructed to by the support. Only disable it if it is undesirable to display Unity Catalog lineage in the graph. Default value: `true`	`true` `false`
databricks.analyzer.unity.catalog.table.lineage.enabled	Set to true if edges between tables for which there is no column-level lineage information available should be generated. Only applies to lineage provided by the Unity Catalog API. Only change the property when instructed to by the support. Only enable it if some expected lineage is missing from the graph. Default value: `false`	`true` `false`
databricks.analyzer.dispatcher.mode	Mode customizing results of the analysis: Filtered results — Only those analysis results that are considered the most precise are used. A typical use case is that the analyzer tries to scan the notebook's code first, and only if it fails, for whatever reason, is the lineage provided by the Unity Catalog API used. All results — All analysis results are used, regardless of their quality. Using this option may cause the produced lineage to be less precise. However, it also decreases the chance that any lineage will be omitted (due to missing runtime values). Only change the property when instructed to by the support. Default value: `All results`	`Filtered results` `All results`

Stream Connection Placeholder

Sometimes dataflow analysis of source code in Databricks notebooks produces many possible string values as a result of more complex string operations. This is particularly problematic when the string contains a stream connection such as a file path. When that happens, a file node is created for each value. If there are too many file nodes, the graph becomes unreadable.

To help with this problem, we have introduced a stream connection placeholder. Instead of creating many nodes with a path that could be incomplete, we use a placeholder (artificial) node that contains all found values in its attribute.

File Path Mapping

File path mapping allows you to map a placeholder node to its real file path which makes the result graph more accurate. To correctly set the file path mapping, see Filesystem Resource Configuration for an explanation. To make it easier to configure, Databricks provides mapping values as node attributes. In the Databricks notebook graph below, note that it was not possible to determine the file path for both input and output - we will focus on mapping the input file path to the correct value. The graph contains two TOO_MANY_POSSIBLE_PATHS placeholder nodes. We can also see that the File path node has Warning saying There were too many possible paths detected. See "Possible paths" attribute for found values. See "Source Path Prefix" attribute for file path mapping RegExp.

No alt text provided

How to Set Up File Path Mapping

Displaying attribute detail using shows escaped backslashes. If you copy values from this element, make sure that in Admin UI, there are only single and double backslashes, as described below.

To create a file path mapping, go to Admin UI, select Configuration from the top menu and then, in the left menu, go to CLI > Common > File path mapping. In the top right corner, select Add line. A form is shown, fill it in followingly:

Source Technology should be set to DATABRICKS.
Source Connection ID is the unique ID of your Databricks notebook that created the node (provided in the node detail).
Source Hostname is localhost.
Source Path Prefix contains a regular expression that should match the file prefix. You should use the path of the file node starting with TOO_MANY_POSSIBLE_PATHS. Then, you need to escape special path characters using the following rules.
- Any of these characters < ( [ { ^ - = $ ! | ] } ) ? * + . > should be prefixed with a backslash; for example, . → \., [ → \[.
- The backslash itself should be converted to four backslashes: \ → \\.
- Example: For the file node path /Filesystem/localhost/TOO_MANY_POSSIBLE_PATHS/GLOBAL 43:0 [-1030055803], you should set this value to:
  - Linux installations: TOO_MANY_POSSIBLE_PATHS/GLOBAL 43:0 \[\-1030055803\]
  - Windows installations: TOO_MANY_POSSIBLE_PATHS\\GLOBAL 43:0 \[\-1030055803\]
- This value is provided in the Source Path Prefix attribute (the Source Path… attribute that can be seen in the following screenshot) for that particular file node.
Target Resource is the resource type of the mapped node (usually set to FILESYSTEM).
Target Hostname is the hostname of the target. This value can be left empty.
Target Path Prefix is a path in the resulting graph; for example, folder_name/filename.txt.
More information about these attributes can be found in Filesystem Resource Configuration.

Using the example above, mapping configuration of the input file would look as follows. Note that it is also possible to map files to Amazon S3 resource. When done, click the Save button in the top right corner. After the configuration, it is necessary to run analysis again for the mapping to take effect.

No alt text provided

The resulting graph with the mapped input resource looks as follows. The same instructions can be used for mapping the output file.

No alt text provided