Databricks Scanner Guide
Databricks is cloud-based platform that can handle data tasks such as data ingestion; generating dashboards and visualizations; and data discovery, annotation and exploration. IBM Automatic Data Lineage is a powerful data lineage platform that simplifies data management by supporting Databricks lineage through Unity Catalog. Databricks Unity Catalog is used to capture runtime lineage in Databricks. Features include lineage support for all languages and column-level lineage. Automatic Data Lineage then extends that lineage by connecting it to lineage outside the Databricks environment.
Follow these steps to configure a connection to Databricks.
Step 1: Create a Databricks Account
-
Get started by creating a Databricks account and setting up a workspace.
-
Set up Unity Catalog. The following articles provide instructions on how to enable your Databricks account to use Unity Catalog.
Step 2: Configure the Connection
Create a new connection in Admin UI http://localhost:8281/manta-admin-gui/app/index.html?#/platform/connections/
to enable automated extraction of Databricks by Automatic Data Lineage. The connection requirements are listed in Databricks Integration Requirements.
Note that Databricks scanner uses Agent. Read
Manta Flow Agent Configuration for
Extraction for more details.
Properties That Must Be Configured
-
Databricks system name — “User-defined” field to identify the Databricks connection.
-
Databricks server hostname — Server host name of the Databricks instance. This is the URL used to log in to the Databricks instance.
-
Databricks authorization token — The authorization token can be obtained within the Databricks UI under User Settings → Access Token within the Databricks User Interface. This token is used to access Databricks APIs to retrieve data about notebooks or tables.
-
Databricks cluster port — The value for this field can be located in Compute → Cluster → Configure → Advanced Options → JDBC/ODBC in the Databricks User Interface. This is used when establishing the JDBC with the Databricks cluster.
-
Databricks cluster HTTP path — The value for this field can be located in Compute → Cluster → Configure → Advanced Options → JDBC/ODBC in the Databricks User Interface. This is used when establishing the JDBC connection with the Databricks cluster.