Registering external data into watsonx.data

If you have pre-existing data (such as Iceberg, Delta, or Hudi tables) in an object store bucket, you can register it into IBM® watsonx.data and use it for running queries. To enable this feature, you must attach the appropriate catalog to the storage.

You can register tables in all three formats. For Iceberg tables, you can register pre-existing data at the bucket level. For Delta and Hudi tables, registration is currently supported only at the table level.

If external changes occur on Iceberg tables through other systems, you may need to sync the data on the watsonx.data side. To facilitate this, you can use sync feature.

For Hudi and Delta tables, explicit sync is unnecessary because the metadata pointer refers to the metadata folder, not an individual metadata file. (For example, Iceberg requires referencing the latest metadata.json file.)

Registering and syncing external Iceberg data

To register and sync external Iceberg data into watsonx.data, complete the following steps:
  1. In the Infrastructure Manager page, click Add component.
  2. Select the storage from the Storage section.
  3. Enter the storage details.
  4. Select Activate now.
  5. Select Catalog type as Apache Iceberg.
  6. Enter the catalog name.
  7. Click Create to create the storage.
  8. To pull the changed data in a storage bucket in watsonx.data, go to the Infrastructure manager page, hover over the Apache Iceberg catalog and click Sync metadata. You can see three options to select the Mode and the corresponding possibility for metadata loss.

    The following are the three sync options:

    1. Register new objects only: Schemas, tables, and metadata that is created by external applications since the last sync operations are added to this catalog. Existing schemas and tables in this catalog are not modified.
    2. Update existing objects only: Schemas, tables, and metadata already present in this catalog are updated to match the current state that is found in the associated bucket but not deleted. Any other schemas, tables, and metadata in the associated bucket are ignored.
    3. Sync all objects: Synchronize all the data or update the existing table that was promoted earlier, except removal of objects. Removal of objects is not synced.
  9. After the synchronization is complete, go to the Data manager to see the catalog that you created and the tables that are pulled from the bucket.
  10. Go to the Query workspace and use these tables to select query and insert data into the existing table.

Related API: For information on related API, see External Iceberg table registration.

Registering external Hudi and Delta Lake data

To register external Hudi and Delta Lake data into watsonx.data complete the following steps:
  1. In the Infrastructure Manager page, click Add component.
  2. Select the storage from the Storage section.
  3. Enter the storage details.
  4. Select Activate now.
  5. Based on the type of table format, select one of the following Catalog type.
    • Apache Hudi
    • Delta Lake
  6. Enter the catalog name.
  7. Click Create to create the storage.
  8. You can register and load table using Register table and load table metadata and load table metadata APIs.
    Note: To register the tables, you must provide the exact location of the metatdata folder. The schema is inferred based on the path in the location url.