Spark ingestion through ibm-lh
tool command line
You can run the ibm-lh tool to ingest data into IBM® watsonx.data through the command line interface (CLI) using the IBM Analytics Engine (Spark). The commands to run the ingestion job are listed in this topic.
watsonx.data Developer edition
watsonx.data on IBM Software Hub
Before you begin
- You must have the Administrator role and privileges in the catalog to do ingestion through the web console.
- Add and register IBM Analytics Engine (Spark). See Adding a Spark engine.
- Add bucket for the target catalog. See Adding storage.
- Create schema and table in the catalog for the data to be ingested. See Creating schemas and Creating tables.
Procedure
- Set the mandatory environment variable
ENABLED_INGEST_MODE
toSPARK_LEGACY
before starting an ingestion job by running the following command:export ENABLED_INGEST_MODE=SPARK_LEGACY
- Set the following environment variables before starting an ingestion job by running the
following commands:
export IBM_LH_BEARER_TOKEN=<token> export IBM_LH_SPARK_JOB_ENDPOINT=https://<cpd_url>/v4/analytics_engines/<instance_id>/spark_applications export MDS_CLIENT_USER=lakehouse export MDS_CLIENT_PASSWORD=<instance secret> export SOURCE_S3_CREDS="AWS_ACCESS_KEY_ID=*******,AWS_SECRET_ACCESS_KEY=*******,ENDPOINT_URL=<endpoint_url>,AWS_REGION=<region>,BUCKET_NAME=<bucket_name>" export TARGET_S3_CREDS="AWS_ACCESS_KEY_ID=*******,AWS_SECRET_ACCESS_KEY=*******,ENDPOINT_URL=<endpoint_url>,AWS_REGION=<region>,BUCKET_NAME=<bucket_name>" export IBM_LH_SPARK_EXECUTOR_CORES=<value> export IBM_LH_SPARK_EXECUTOR_MEMORY=<value> export IBM_LH_SPARK_EXECUTOR_COUNT=<value> export IBM_LH_SPARK_DRIVER_CORES=<value> export IBM_LH_SPARK_DRIVER_MEMORY=<value> export INSTANCE_ID=<instance_id> export IBM_LH_URL=https://<lh_hostname> export USE_NATIVE_SPARK=<true/false> export USE_EXTERNAL_SPARK=<true/false>
Note: If IBM Analytics Engine serverless instance on IBM Cloud is registered as external Spark on watsonx.data, the Spark driver, executor vCPU and memory combinations must be in a 1:2, 1:4, or 1:8 ratio. See Default limits and quotas for Analytics Engine instances.Environment variable name Description IBM_LH_BEARER_TOKEN Authorization bearer token. For more information, see Get authorization token.
IBM_LH_SPARK_JOB_ENDPOINT Spark applications v4 endpoint for CPD and v3 endpoint for SaaS.
- Refer to Step 1 in document: Managing Analytics Engine Powered by Apache Spark instances to retrieve CPD Spark Endpoint
MDS_CLIENT_USER User for Metadata Service client.
CPD Spark implementation uses
lakehouse
.MDS_CLIENT_PASSWORD Password for Metadata Service client. For CPD, Run the following command: oc exec -it <lhconsole-api-pod-name> -- cat /mnt/infra/ibm-lh-secrets/LH_INSTANCE_SECRET
SOURCE_S3_CREDS S3 credentials for the source file bucket in the format:
“AWS_ACCESS_KEY_ID=<access_key>,AWS_SECRET_ACCESS_KEY=<secret_key>,ENDPOINT_URL=<endpoint_url>,AWS_REGION=<region>,BUCKET_NAME=<bucket_name>”
TARGET_S3_CREDS S3 credentials for the target table bucket in the format:
“AWS_ACCESS_KEY_ID=<access_key>,AWS_SECRET_ACCESS_KEY=<secret_key>,ENDPOINT_URL=<endpoint_url>,AWS_REGION=<region>,BUCKET_NAME=<bucket_name>”
IBM_LH_SPARK_EXECUTOR_CORES Optional spark engine configuration setting for executor cores
IBM_LH_SPARK_EXECUTOR_MEMORY Optional spark engine configuration setting for executor memory
IBM_LH_SPARK_EXECUTOR_COUNT Optional spark engine configuration setting for executor count
IBM_LH_SPARK_DRIVER_CORES Optional spark engine configuration setting for driver cores
IBM_LH_SPARK_DRIVER_MEMORY Optional spark engine configuration setting for driver memory
INSTANCE_ID Identify unique instances. The instance ID is available in the URL of the instance or from the instance details page (Click the information icon on the watsonx.data instance UI screen). This is a mandatory parameter.
USE_NATIVE_SPARK When native spark is used for ingestion, this parameter value must be true
. This is a mandatory parameter.USE_EXTERNAL_SPARK When external spark is used for ingestion, this parameter value must be true
.IBM_LH_URL This parameter is used only when USE_EXTERNAL_SPARK=true
. The value ishttps://<lh_hostname>
.<lh_hostname>
is the hostname of CPD instance. - You can run ingestion jobs to ingest data in 2 ways, using a simple command line or a
config file.
- Run the following command to ingest data from single or multiple source data
files:
ibm-lh data-copy --source-data-files s3://path/to/file/or/folder \ --target-table <catalog>.<schema>.<table> \ --ingestion-engine-endpoint "hostname=<hostname>,port=<port>,type=spark" \ --trust-store-password <truststore password> \ --trust-store-path <truststore path> \ --log-directory /tmp/mylogs \ --partition-by "<columnname1>, <columnname2>" \ --cert-file-path /root/ibm-lh-manual/ibm-lh/cert.pem \ --target-catalog-uri 'thrift://<hms_thrift_uri>'
Where the parameters used are listed as follows:
Parameter Description
--source-data-files
Path to S3 parquet or CSV file or folder. Folder paths must end with “/”. File names are case sensitive.
--target-table
Target table in format
<catalogname>.<schemaname>.<tablename>
.--ingestion-engine-endpoint
Ingestion engine endpoint is in the format
hostname=’’,port=’’,type=spark”
. Type must be set tospark
.--log-directory
This option is used to specify the location of log files.
--partition-by
This parameter supports the functions for years, months, days, hours for timestamp in the partition-by
list. If a target table already exist or thecreate-if-not-exist
parameter is not mentioned the partition-by shall not make any effect on the data.--trust-store-password
Password of the truststore certificate inside the spark job pod. Current password for Spark in CPD and SaaS is
changeit
--trust-store-path
Path of the truststore cert inside the spark job pod. Current path of Spark in CPD and SaaS is
file:///opt/ibm/jdk/lib/security/cacerts
--target-catalog-uri
MDS thrift endpoint.
-
CPD endpoint example:
thrift://<metastore_host_value>
-
SaaS endpoint example:
thrift://<metastore_host_value>
<metastore_host_value> is taken from the details tab of the catalog in the Infrastructure page.
--cert-file-path
To verify CPD certificate. --create-if-not-exist
Use this option if the target schema or table is not created. Do not use if the target schema or table is already created.
--schema
Use this option with value in the format
path/to/csvschema/config/file
. Use the path to a schema.cfg file which specifies header and delimiter values for CSV source file or folder. -
- Run the following command to ingest data from a config
file:
ibm-lh data-copy --ingest-config /<your_ingest_configfilename>
Where the config file has the following information:
[global-ingest-config] target-tables:<catalog>.<schema>.<table> ingestion-engine:hostname='',port='',type=spark create-if-not-exist:true/false [ingest-config1] source-files:s3://path/to/file/or/folder target-catalog-uri:thrift://<hms_thrift_uri> trust-store-path:<truststore path> trust-store-password:<truststore password> log-directory /tmp/mylogs partition-by "<columnname1>, <columnname2>" cert-file-path /root/ibm-lh-manual/ibm-lh/cert.pem schema:/path/to/csvschema/config/file [Optional]
The parameters used in the config file ingestion job is listed as follows:
Parameter Description
source-files
Path to s3 parquet or CSV file or folder. Folder paths must end with “/”. File names are case sensitive.
target-table
Target table in format <catalogname>.<schemaname>.<tablename>.
ingestion-engine
Ingestion engine endpoint is in the format
hostname=’’,port=’’,type=spark”
. Type must be set tospark
.--partition-by
This parameter supports the functions for years, months, days, hours for timestamp in the partition-by
list. If a target table already exist or thecreate-if-not-exist
parameter is not mentioned the partition-by shall not make any effect on the data.trust-store-password
Password of the truststore certificate inside the spark job pod. Current password for Spark in CPD and SaaS is
changeit
trust-store-path
Path of the truststore cert inside the spark job pod. Current path of Spark in CPD and SaaS is
file:///opt/ibm/jdk/lib/security/cacerts
target-catalog-uri
MDS thrift endpoint.
-
CPD endpoint example:
thrift://<metastore_host_value>
-
SaaS endpoint example:
thrift://<metastore_host_value>
<metastore_host_value> is taken from the details tab of the catalog in the Infrastructure page.
--cert-file-path
To verify CPD certificate. create-if-not-exist
Use this option if the target schema or table is not created. Do not use if the target schema or table is already created.
schema
Use this option with value in the format
path/to/csvschema/config/file
. Use the path to a schema.cfg file which specifies header and delimiter values for CSV source file or folder.log-directory
This option is used to specify the location of log files.
-
Note: The ability to handle special characters in table and schema names for ingestion is constrained by the underlying engines (Spark, Presto) and their respective special character support.Regular syntax:
--target-tables <catalogname>.<schemaname>.<tablename>
.Syntax with special character option 1:
--target-tables <catalogname>.<schemaname>."table\.name"
. Using this syntax, escape character `\` is used within double quotes to escape period(.). Escape character `\` is used only when special character period(.) is in the table name.Syntax with special character option 2:
--target-tables <catalogname>.<schemaname>."'table.name'"
. Using this syntax, period(.) is not escaped nor need to use the escape character when using additional single quotes. - Run the following command to ingest data from single or multiple source data
files: