Overview of data ingestion in watsonx.data

Data ingestion is the process of importing and loading data into IBM® watsonx.data. You can use the Ingestion jobs tab from the Data manager page to securely and easily load data into watsonx.data console. Alternatively, you can also load or ingest local data files to create tables using the Create table option.

watsonx.data on Red Hat® OpenShift®

watsonx.data SaaS on AWS

When you ingest a data file into the watsonx.data, the table schema is generated and inferred when a query is run.

Data ingestion in watsonx.data supports CSV and Parquet formats. The files to be ingested must be of the same format type and same schema. The watsonx.data auto discovers the schema based on the source file being ingested.

Following are some of the requirements or limitations of the tool:

  • Schema evolution is not supported.
  • Target table must be an iceberg format table.
  • Partitioning is not supported.
  • IBM Storage Ceph, IBM Cloud Object Storage (COS), AWS S3, and MinIO object storage are supported.
  • pathStyleAccess property for object storage is not supported.
  • Only Parquet and CSV file formats are supported as source data files.

Loading or ingesting data through CLI

An ingestion job in watsonx.data can be run with the ibm-lh tool. The tool must be pulled from the ibm-lh-client and installed in the local system to run the ingestion job through the CLI. For more details and instructions to install ibm-lh-client package and use the ibm-lh tool for ingestion, see Installing ibm-lh-client and Setting up the ibm-lh command-line utility.

The ibm-lh tool supports the following features:
  • Auto-discovery of schema based on the source file or target table.

  • Advanced table configuration options for the CSV files:
    • Delimiter
    • Header
    • File encoding
    • Line delimiter
    • Escape characters
  • Ingestion of single, multiple files, or single folder (no sub folders) of S3 and local Parquet files.

  • Ingestion of single, multiple files, or single folder (no sub folders) of S3 and local CSV files.