You can run the ibm-lh utility to ingest data into IBM®
watsonx.data through the command line interface (CLI)
using the IBM Analytics Engine (Spark) REST API. This CLI based ingestion uses REST endpoint to do
the ingestion. This is the default mode of ingestion. The commands to run the ingestion job are
listed in this topic.
watsonx.data Developer
edition
watsonx.data on IBM Software
Hub
Before you begin
- You must have the Administrator role and privileges in the catalog to do ingestion
through the web console.
- Add and register IBM Analytics Engine (Spark). See Adding a Spark engine.
- Add bucket for the target catalog. See Adding storage.
- Create schema and table in the catalog for the data to be ingested. See Creating schemas and Creating tables.
Procedure
- Set the mandatory environment variable
ENABLED_INGEST_MODE
to
SPARK
before starting an ingestion job by running the following command:
export ENABLED_INGEST_MODE=SPARK
- Set the following environment variables before starting an ingestion job by running the
following commands:
export IBM_LH_SPARK_EXECUTOR_CORES=1
export IBM_LH_SPARK_EXECUTOR_MEMORY=2G
export IBM_LH_SPARK_EXECUTOR_COUNT=1
export IBM_LH_SPARK_DRIVER_CORES=1
export IBM_LH_SPARK_DRIVER_MEMORY=2G
export CPD_VERIFY_CERTIFICATE=<true/false>
Environment variable name |
Description |
IBM_LH_SPARK_EXECUTOR_CORES |
Optional spark engine configuration setting for executor cores
|
IBM_LH_SPARK_EXECUTOR_MEMORY |
Optional spark engine configuration setting for executor memory
|
IBM_LH_SPARK_EXECUTOR_COUNT |
Optional spark engine configuration setting for executor count
|
IBM_LH_SPARK_DRIVER_CORES |
Optional spark engine configuration setting for driver cores
|
IBM_LH_SPARK_DRIVER_MEMORY |
Optional spark engine configuration setting for driver memory
|
CPD_VERIFY_CERTIFICATE |
To turn ON certificate verification, set the variable to true . |
- Run the following command to ingest data from a single or multiple source data
files:
ibm-lh data-copy --target-table iceberg_data.ice_schema.ytab \
--source-data-files "s3://lh-ingest/hive/warehouse/folder_ingestion/" \
--user admin \
--password **** \
--url https://cpd-cpd-instance.apps.cpd-ocp-wxd-fips.cp.fyre.ibm.com \
--instance-id 1719823250083405 \
--schema /home/nz/config/schema.cfg \
--engine-id spark214 \
--log-directory /tmp/mylogs \
--partition-by "<columnname1>, <columnname2>" \
--cert-file-path /root/ibm-lh-manual/ibm-lh/cert.pem \
--create-if-not-exist
Where the parameters used are listed as follows:
Parameter |
Description |
--cert-file-path |
To verify CPD certificate. This parameter is used when
CPD_VERIFY_CERTIFICATE=true . |
--create-if-not-exist |
Use this option if the target schema or table is not created. Do not use if the target schema
or table is already created. |
--engine-id |
Engine id of Spark engine when using REST API based SPARK ingestion. |
--instance-id |
Identify unique instances. In SaaS environment, CRN is the instance id. |
--log-directory |
This option is used to specify the location of log files. |
--password |
Password of the user connecting to the instance. In SaaS, API key to the instance is
used. |
--partition-by |
This parameter supports the functions for years, months, days, hours for timestamp in the
partition-by list. If a target table already exist or the
create-if-not-exist parameter is not mentioned the partition-by shall not make any
effect on the data. |
--schema |
Use this option with value in the format path/to/csvschema/config/file. Use the path to a
schema.cfg file which specifies header and delimiter values for CSV source file or folder. |
--source-data-files |
Path to s3 parquet or CSV file or folder. Folder paths must end with “/”. File names are case
sensitive. |
--target-table |
Target table in format
<catalogname>.<schemaname>.<tablename> . |
--user |
User name of the user connecting to the instance. |
--url |
Base url of the location of IBM® watsonx.data cluster. |
Tip: ibm-lh data-copy
returns the value 0
when
ingestion job is completed successfully. When ingestion job has failed, ibm-lh
data-copy
returns a non 0
value.
- Run the following command to get the status of the ingest job:
ibm-lh get-status --job-id <Job-id> --instance-id --url --user --password
Where the parameter used is listed as follows:
Parameter |
Description |
--job-id<Job id> |
Job id is generated when REST API or UI based ingestion is initiated. This job id is used in
getting the status of ingestion job. This parameter is used only used with ibm-lh
get-status command. The short command for this parameter is -j |
- Run the following command to get the history of all ingestion jobs:
ibm-lh get-status --all-jobs --instance-id --url --user --password
Where the parameter used is listed as follows:
Parameter |
Description |
--all-jobs |
This all-jobs parameter gives the history of all ingestion jobs. This
parameter is used only used with ibm-lh get-status command. |
Note: get-status
is supported with ibm-lh
only in the interactive
mode of ingestion.