Ingesting data through command line - Presto ingestion mode
This topic provides step-by-step instructions to ingest data into IBM®
watsonx.data by using the command line
Presto ingestion mode.
watsonx.data Developer edition
watsonx.data on Red Hat® OpenShift®
Before you begin
- Set the mandatory environment variable
ENABLED_INGEST_MODEtoPRESTObefore starting an ingestion job by running the following command:export ENABLED_INGEST_MODE=PRESTO -
Set the environment variables for
SOURCE_S3_CREDSandSTAGING_S3_CREDSbased on the requirements before starting an ingestion job by running the following commands:export SOURCE_S3_CREDS="AWS_ACCESS_KEY_ID=xxxxxxx,AWS_SECRET_ACCESS_KEY=yyyyyyyy,ENDPOINT_URL=https://s3.jp-tok.cloud-object-storage.appdomain.cloud,AWS_REGION=us-east-1,BUCKET_NAME=cust-bucket"export STAGING_S3_CREDS="AWS_ACCESS_KEY_ID=xxxxxxx,AWS_SECRET_ACCESS_KEY=yyyyyyyy,ENDPOINT_URL=https://s3.jp-tok.cloud-object-storage.appdomain.cloud,AWS_REGION=us-east-1,BUCKET_NAME=cust-bucket" - To ingest data, you will need values for the parameters that are used. For example, access key, secret key, endpoint, hostname etc. See the Ingestion options and parameters supported in ibm-lh utility for details of the parameters and values applicable to your environment.
About this task
Following are the details of the command line option to ingest data files from S3 or local location to watsonx.data Iceberg table:
-
Ingest a single CSV/Parquet file from S3 location by using command.
To ingest a single CSV/Parquet file from a S3 location, run the following command:
ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \ --staging-location s3://lh-target/staging \ --target-table TARGET_TABLES \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor example:
ibm-lh data-copy --source-data-files s3://cust-bucket/warehouse/a_source_file.parquet \ --staging-location s3://cust-bucket/warehouse/staging/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-exist -
Ingest multiple CSV/Parquet files and CSV folders from S3 location by using command.
To ingest multiple Parquet files from a S3 location, run the following command:
ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \ --staging-location s3://lh-target/staging \ --target-table TARGET_TABLES \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor examples:
ibm-lh data-copy --source-data-files s3://cust-bucket/warehouse/a_source_file1.csv,s3://cust-bucket/warehouse/a_source_file2.csv \ --staging-location s3://cust-bucket/warehouse/staging/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-existibm-lh data-copy --source-data-files s3://cust-bucket/warehouse/ \ --staging-location s3://cust-bucket/warehouse/staging/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-exist -
Ingest all Parquet files in a folder from S3 location by using command.
To ingest all Parquet files in a folder from a S3 location, run the following command:
ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \ --target-table TARGET_TABLES \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor example:
ibm-lh data-copy --source-data-files s3://cust-bucket/warehouse/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-existNote: In general, this option does not require a staging location. However, a few exceptional scenarios are there when a staging location must be specified. When the staging location is not used, make sure that the hive catalog configured with Presto can be used with source-data-files location. The following are the exceptional cases where a staging location is required:- Any or all parquet files in the folder are huge.
- Any or all parquet files in the folder have special columns, such as TIME.
-
Ingest a CSV/Parquet file or folder from a local file system by using command.
To ingest a single Parquet file from a local location, run the following command:
ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \ --staging-location s3://lh-target/staging \ --target-table TARGET_TABLES \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor examples:
ibm-lh data-copy --source-data-files /cust-bucket/warehouse/a_source_file1.parquet \ --staging-location s3://cust-bucket/warehouse/staging/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-existibm-lh data-copy --source-data-files /cust-bucket/warehouse/ \ --staging-location s3://cust-bucket/warehouse/staging/ \ --target-table iceberg_target_catalog.ice_schema.cust_tab1 \ --ingestion-engine-endpoint "hostname=localhost,port=8080" \ --create-if-not-exist -
Ingest any data file from local file system by using a command.
To ingest any data file from a local location, run the following command:
Note: To ingest any type of data files from a local file system, data files are needed to be copied to~ /ibm-lh-client/localstorage/volumes/ibm-lhdirectory. Now, you can access data files from/ibmlhdata/directory by using theibm-lh data-copycommand.ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \" \ --staging-location s3://lh-target/staging \ --target-table TARGET_TABLES \ --staging-hive-catalog <catalog_name> \ --schema <SCHEMA> \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --trust-store-path <TRUST_STORE_PATH> \ --trust-store-password <TRUST_STORE_PASSWORD> \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor examples:
ibm-lh data-copy --source-data-files /ibmlhdata/reptile.csv \ --staging-location s3://watsonx.data/staging \ --target-table iceberg_data.ivt_sanity_test_1.reptile \ --staging-hive-catalog hive_test \ --schema /ibmlhdata/schema.cfg \ --ingestion-engine-endpoint "hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443" \ --trust-store-path /mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt \ --trust-store-password changeit \ --dbuser xxxx\ --dbpassword xxxx \ --create-if-not-exist -
Ingest CSV or local Parquet or S3 Parquet files that use staging location.
To ingest CSV/local Parquet/S3 Parquet files that use staging location:
ibm-lh data-copy --source-data-files SOURCE_DATA_FILE \ --staging-location s3://lh-target/staging \ --target-table TARGET_TABLES \ --staging-hive-catalog <catalog_name> \ --staging-hive-schema <schema_name> \ --ingestion-engine-endpoint INGESTION_ENGINE_ENDPOINT \ --trust-store-path <TRUST_STORE_PATH> \ --trust-store-password <TRUST_STORE_PASSWORD> \ --dbuser DBUSER \ --dbpassword DBPASSWORD \ --create-if-not-existFor examples:
ibm-lh data-copy --source-data-files s3://watsonx-data-0823-2/test_icos/GVT-DATA-C.csv \ --staging-location s3://watsonx.data-staging \ --target-table iceberg_data.test_iceberg.gvt_data_v \ --staging-hive-catalog staging_catalog \ --staging-hive-schema staging_schema \ --ingestion-engine-endpoint "hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443" \ --trust-store-path /mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt \ --trust-store-password changeit \ --dbuser xxxx\ --dbpassword xxxx \ --create-if-not-existHere,
--staging-locationiss3://watsonx.data-staging. The--staging-hive-catalogthat isstaging_catalogmust be associated with the bucketwatsonx.data-staging.