Databricks Integration Requirements

The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including their performance.

The Manta Databricks scanner uses the Databricks API to connect to the Databricks instance. The Automatic Data Lineage instance must have network access to the Databricks API (hosted by Databricks). To access the Databricks API, it is necessary to provide a personal access token (PAT). The token can be obtained through the Databricks UI and will be used to authenticate the Manta Databricks scanner to the scanned Databricks instance. To be able to extract all assets, the safest bet is for the user (that the PAT belongs to) to be a metastore admin. otherwise, the following privileges are needed for individual entity extraction.

Catalogs — Only catalogs that the user owns or that have the USE_CATALOG privilege are extracted
Schemas — Only schemas that the user owns or that have the USE_SCHEMA privilege are extracted, and the user needs to be an owner or have the USE_CATALOG privilege for the owner catalog
Tables/Views — Only tables/views that the user owns or that have the SELECT privilege are extracted, and the user needs to be an owner or have the USE_CATALOG privilege for the owner catalog, and be an owner or have the USE_SCHEMA privilege for the owner schema
Functions — Only functions that the user owns or that have the EXECUTE privilege are extracted, and the user needs to be an owner or have the USE_CATALOG privilege for the owner catalog, and be an owner or have the USE_SCHEMA privilege for the owner schema
Notebooks — Only notebooks that the user owns or that have (at least) view access are extracted

Hive metastore assets are also accessed with the personal access token. Here, the SELECT privilege is needed for all the assets to be extracted (e.g., schemas, tables).

Important: To enable extraction of data from Hive Metastore, it is currently necessary to copy a JAR file with the Databricks JDBC driver ( https://www.databricks.com/spark/jdbc-drivers-download) to the <MANTA_AGENT_DIR_HOME>/manta-flow-agent-dir/lib-ext folder. Otherwise, the extraction from Hive Metastore won’t be performed. For more information, Go to IBM Support. As of Automatic Data Lineage R42, if the driver is not provided, the extraction will always produce an error reminding the user about the missing driver. If the driver was intentionally not provided — for example, if nothing from Hive Metastore should be extracted — then the hive_metastore catalog should be included in the excluded catalogs list in the connection configuration.

Requirements to extract Unity Catalog Lineage

The workspace must have Unity Catalog enabled. (https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-your-workspace-for-unity-catalog )
Tables must be registered in a Unity Catalog metastore.
Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces.
To view lineage for a table or view, users must have the SELECT privilege on the table or view.
To view lineage information for notebooks and workflows, users must have permissions on these objects as defined by the access control settings in the workspace.
To view lineage for a Unity Catalog-enabled workflow, you must have CAN_VIEW permissions on the pipeline.

Supported Extraction Features

Fetching of information about notebooks in the Databricks instance through Databricks APIs
Fetching of information about standalone queries in the Databricks instance through Databricks APIs
Fetching of information about database assets (e.g., tables, views, functions) in the Databricks instance through Databricks APIs
Extraction of dictionaries from Hive Metastore
Extraction of dictionaries from Unity Catalog

Supported Data Flow Analysis Features

Visualization of lineage information for Unity Catalog views and functions by scanning extracted SQL definitions
Visualization of lineage information for standalone queries as returned by the Unity Catalog API
Visualization of lineage information for notebooks as returned by the Unity Catalog API
Visualization of lineage information for jobs and workflows as returned by the Unity Catalog API

Supported SQL Features

SELECT and INSERT and basic expression handling
CREATE|ALTER|DROP CATALOG
CREATE|ALTER|DROP SCHEMA
CREATE|ALTER|DROP DATABASE
CREATE|ALTER|DROP TABLE
CREATE|ALTER|DROP VIEW
USE CATALOG
USE DATABASE
USE SCHEMA
UPDATE — see https://docs.databricks.com/sql/language-manual/delta-update.html
LOAD DATA — see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-dml-load.html
CREATE FUNCTION ... AS ... USING JAR
CREATE FUNCTION ... RETURNS ... — see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html
Basic parsing and resolution of lambda function parameters

Known Unsupported Features

Automatic Data Lineage does not support the following Databricks features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.

Scanning of notebook Scala commands (depending on the use case, the OpenLinegae Scanner could be a solution)
Scanning of notebook R commands
Analysis of Custom Libraries
Extraction of definitions of functions and views from Hive Metastore