Working with Apache Delta Lake catalog
The topic describes the procedure to run a Spark application that ingests data into a
Delta Lake catalog.
Before you begin
Metastore admin
role. Without Metastore admin
privilege, you
cannot ingest data to storage using Native Spark engine. To enable your Spark application to work
with the watsonx.data catalog and storage, add
the following configuration to your application
payload:.spark.hadoop.wxd.apiKey=Basic base64(ibmlhapikey_ibmcloudid:apikey)
Procedure
- Create a storage with Delta Lake catalog to ingest and manage data in delta table format. To create storage with Delta Lake catalog, see Adding a storage-catalog pair.
- Associate the storage with the Native Spark engine. For more information, see Associating a catalog with an engine.
- Create Cloud Object Storage (COS) to store the Spark application. To create Cloud Object Storage and a bucket, see Creating a storage bucket.
- Register the Cloud Object Storage in watsonx.data. For more information, see Adding a storage-catalog pair.
- Save the following Spark application (Python file) to your local machine. Here,
delta_demo.py
.The Python Spark application demonstrates the following functionality:- It creates a database inside the Delta Lake catalog (that you created to store data). Here,
iae
. - It creates a table inside the
iae
database, namelyemployee
. - It inserts data into the
employee
and doesSELECT
query operation. - It drops the table and schema after use.
from pyspark.sql import SparkSession import os def init_spark(): spark = SparkSession.builder.appName("lh-hms-cloud")\ .enableHiveSupport().getOrCreate() return spark def main(): spark = init_spark() spark.sql("show databases").show() spark.sql("create database if not exists spark_catalog.iae LOCATION 's3a://delta-connector-test/'").show() spark.sql("create table if not exists spark_catalog.iae.employee (id bigint, name string, location string) USING DELTA").show() spark.sql("insert into spark_catalog.iae.employee VALUES (1, 'Sam','Kochi'), (2, 'Tom','Bangalore'), (3, 'Bob','Chennai'), (4, 'Alex','Bangalore')").show() spark.sql("select * from spark_catalog.iae.employee").show() spark.sql("drop table spark_catalog.iae.employee").show() spark.sql("drop schema spark_catalog.iae CASCADE").show() spark.stop() if __name__ == '__main__': main()
- It creates a database inside the Delta Lake catalog (that you created to store data). Here,
- Upload the Spark application to the COS, see Uploading data.
- To submit the Spark application with data residing in Cloud Object Storage, specify the
parameter values and run the following curl command.
curl --request POST \ --url https://<wxd_host_name>/lakehouse/api/v2/spark_engines/<spark_engine_id>/applications \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --header 'LhInstanceId: <instance_id>' \ --data '{ "application_details": { "conf": { "spark.sql.catalog.spark_catalog" : "org.apache.spark.sql.delta.catalog.DeltaCatalog", "spark.sql.catalog.spark_catalog.type" : "hive", "spark.hadoop.wxd.apiKey":"ZenApiKey <user-authentication-string>" }, "application": "s3a://delta-connector-test/delta_demo.py" } }
Parameter values:- <wxd_host_name>: The hostname of your watsonx.data.
- <spark_engine_id> : The Engine ID of the native Spark engine.
- <token> : The bearer token. For more information about generating the token, see Generating a bearer token.
- <instance_id> : The instance ID from the watsonx.data cluster instance
URL. For example,
1609968977179454
. - <user-authentication-string> : The value must be in the format :
echo -n "<username>:<your Zen API key>" | base64
. TheZen API Key
here is the API key of the user accessing the Object store bucket. To generate API key, log in into the watsonx.data console and navigate toProfile > Profile and Settings > API Keys
and generate a new API key.Note: If you generate a new API key, your old API key becomes invalid.
- After you submit the Spark application, you receive a confirmation message with the application ID and Spark version. Save it for reference.
- Log in to the watsonx.data cluster, access the Engine details page. In the Applications tab, use application ID to list the application and track the stages. For more information, see View and manage applications.