Overview of data ingestion in watsonx.data

Data ingestion is the process of importing and loading data into IBM® watsonx.data. In the user interface (UI) of watsonx.data, you can use the Ingest data module from the Data manager page to load data securely and easily. Alternatively, you can ingest local or remote data files to create tables by using the Create table from file option.

watsonx.data on IBM Software Hub

When you ingest a data file into watsonx.data, the table schema is generated and inferred when a query is run. The files that are ingested must be in the same format and have the same schema. watsonx.data automatically discovers the schema based on the source file.

Data ingestion has the following requirements and behavior:

  • Schema evolution is not supported.
  • The target table must be in Iceberg format.
  • IBM Storage Ceph, IBM Cloud Object Storage (COS), AWS S3, and MinIO object storage are supported.
  • pathStyleAccess property for object storage is not supported.
  • Supported source file formats are .txt, .csv, Parquet, JSON, ORC, and Avro.
  • For local ingestion, the cumulative file size must not exceed 500 MB.
  • Parquet, JSON, ORC, and Avro files larger than 2 MB cannot be previewed, but they are ingested successfully.
  • JSON files with complex nested objects and arrays cannot be previewed in the UI.
  • Complex JSON files are ingested as is, which results in arrays as table entries. This format is not recommended for optimal data visualization and analysis.
  • Keys in JSON files must be enclosed in quotation marks for proper parsing and interpretation.

Loading or ingesting data by using the CLI

Use the ./cpdctl wx-data ingestion command in IBM cpdctl for all ingestion use cases. For more information about how to use the IBM CPDCTL CLI, see IBM cpdctl.

The ./cpdctl wx-data ingestion command supports the following features:

  • Auto-discovery of schema based on the source file or target table.

  • Advanced table configuration options for CSV files:
    • Delimiter
    • Header
    • File encoding
    • Line delimiter
    • Escape characters
  • Ingest a single file, multiple files, or a single folder (no subfolders) of S3 and local Parquet files.

  • Ingest a single file, multiple files, or a single folder (no subfolders) of S3 and local CSV files.