Databricks Integration Requirements
The following are the prerequisites necessary for IBM Automatic Data Lineage to connect to this third-party system, which you may choose to do at your sole discretion. Note that while these are usually sufficient to connect to this third-party system, we cannot guarantee the success of the connection or integration since we have no control, liability, or responsibility for third-party products or services, including their performance.
The Manta Databricks scanner uses the Databricks API to connect to the Databricks instance. The Automatic Data Lineage instance must have network access to the Databricks API (hosted by Databricks). To access the Databricks API, it is necessary to provide a personal access token (PAT). The token can be obtained through the Databricks UI and will be used to authenticate the Manta Databricks scanner to the scanned Databricks instance. To be able to extract all assets, the safest bet is for the user (that the PAT belongs to) to be a metastore admin. otherwise, the following privileges are needed for individual entity extraction.
-
Catalogs — Only catalogs that the user owns or that have the
USE_CATALOG
privilege are extracted -
Schemas — Only schemas that the user owns or that have the
USE_SCHEMA
privilege are extracted, and the user needs to be an owner or have theUSE_CATALOG
privilege for the owner catalog -
Tables/Views — Only tables/views that the user owns or that have the
SELECT
privilege are extracted, and the user needs to be an owner or have theUSE_CATALOG
privilege for the owner catalog, and be an owner or have theUSE_SCHEMA
privilege for the owner schema -
Functions — Only functions that the user owns or that have the
EXECUTE
privilege are extracted, and the user needs to be an owner or have theUSE_CATALOG
privilege for the owner catalog, and be an owner or have theUSE_SCHEMA
privilege for the owner schema -
Notebooks — Only notebooks that the user owns or that have (at least) view access are extracted
Hive metastore assets are also accessed with the personal access token. Here, the SELECT
privilege is needed for all the assets to be extracted (e.g., schemas, tables).
<MANTA_AGENT_DIR_HOME>/manta-flow-agent-dir/lib-ext
folder. Otherwise, the extraction from Hive
Metastore won’t be performed. For more information, Go to IBM Support. As of Automatic Data Lineage R42, if the driver is not provided, the extraction will always produce an error reminding the user
about the missing driver. If the driver was intentionally not provided — for example, if nothing from Hive Metastore should be extracted — then the hive_metastore
catalog should be included in the excluded catalogs list in the connection
configuration.
Requirements to extract Unity Catalog Lineage
-
The workspace must have Unity Catalog enabled. (https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-your-workspace-for-unity-catalog )
-
Tables must be registered in a Unity Catalog metastore.
-
Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces.
-
To view lineage for a table or view, users must have the
SELECT
privilege on the table or view. -
To view lineage information for notebooks and workflows, users must have permissions on these objects as defined by the access control settings in the workspace.
-
To view lineage for a Unity Catalog-enabled workflow, you must have
CAN_VIEW
permissions on the pipeline.
Supported Extraction Features
-
Fetching of information about notebooks in the Databricks instance through Databricks APIs
-
Fetching of information about standalone queries in the Databricks instance through Databricks APIs
-
Fetching of information about database assets (e.g., tables, views, functions) in the Databricks instance through Databricks APIs
-
Extraction of dictionaries from Hive Metastore
-
Extraction of dictionaries from Unity Catalog
Supported Data Flow Analysis Features
-
Visualization of lineage information for Unity Catalog views and functions by scanning extracted SQL definitions
-
Visualization of lineage information for standalone queries as returned by the Unity Catalog API
-
Visualization of lineage information for notebooks as returned by the Unity Catalog API
-
Visualization of lineage information for jobs and workflows as returned by the Unity Catalog API
Supported SQL Features
-
SELECT
andINSERT
and basic expression handling -
CREATE|ALTER|DROP CATALOG
-
CREATE|ALTER|DROP SCHEMA
-
CREATE|ALTER|DROP DATABASE
-
CREATE|ALTER|DROP TABLE
-
CREATE|ALTER|DROP VIEW
-
USE CATALOG
-
USE DATABASE
-
USE SCHEMA
-
UPDATE
— see https://docs.databricks.com/sql/language-manual/delta-update.html -
LOAD DATA
— see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-dml-load.html -
CREATE FUNCTION ... AS ... USING JAR
-
CREATE FUNCTION ... RETURNS ...
— see https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html -
Basic parsing and resolution of lambda function parameters
Known Unsupported Features
Automatic Data Lineage does not support the following Databricks features. This list includes all of the features that IBM is aware are unsupported, but it might not be comprehensive.
-
Scanning of notebook Scala commands (depending on the use case, the OpenLinegae Scanner could be a solution)
-
Scanning of notebook R commands
-
Analysis of Custom Libraries
-
Extraction of definitions of functions and views from Hive Metastore