Ingesting data from object storage bucket

Important: IBM Cloud Pak® for Data Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

In this tutorial, you learn to move data into a data lake or an object storage bucket and load the data files to Presto. You learn to optimize the file format to choose the table format and run complex SQL query in watsonx.data.

Before you begin

This tutorial requires:

  • Subscription of watsonx.data on cloud.
  • The configuration details of data bucket that you bring in. This is required for establishing connection with the watsonx.data.
  • Ensure that the data bucket has data.

About this task

Scenario: You need to run SQL query on data files that is in your object storage bucket. For this, you must attach the data files in your object storage bucket to Presto. You can also convert data into an optimized analytical format in Parquet or ORC to enhance query performance and reduce server and storage resource consumption. Now, you can run SQL query against the table you created.

The objectives of this tutorial are listed as follows:
  • Creating infrastructure within the watsonx.data service.

  • Establishing connection with the customer data bucket.

  • Querying from the bucket

Procedure

  1. Uploading data into an object storage bucket and attaching to Presto

    In this section of the tutorial, you are going to manage data in an object storage bucket and attach the bucket to HMS and associate with Presto engine.

    1. Access any one of the object storage access tools like S3 Browser, AWS S3 console, direct S3 APIs, and various CLI/UI object storage tools.

    2. Load data files to your object storage bucket by using the tool.

    3. Register and attach the object storage bucket to HMS and associate with Presto engine by using watsonx.data UI.

    4. Alternatively, you can also register and attach an object storage bucket with pre-existing data to HMS.

  2. Load data files into Presto

    After you attach the object storage bucket to HMS, you need to load data files into Presto by creating schema and external tables through the Hive connector.

    1. Run the following command to create schema for the data you want to access.
      CREATE SCHEMA <SCHEMA_NAME> WITH ( location = '<SCHEMA_LOCATION>' );

      For example:

      CREATE SCHEMA hive.gosales WITH ( location = 's3a://lhbeta/gosales' );
    2. Run the following command to create table by using an external location by pointing to an uploaded data file.
      CREATE TABLE IF NOT EXISTS <TABLE_NAME> ("<COLUMN_NAMES>" <DATA_TYPE>) WITH ( format = '<DATA_FORMAT>', external_location = '<DATA_FILE_LOCATION>' );

      For example:

      CREATE TABLE IF NOT EXISTS hive.gosales.branch ("BRANCH_CODE" int, "ADDRESS1" varchar, "ADDRESS1_MB" varchar, "ADDRESS2" varchar, "ADDRESS2_MB" varchar, "CITY" varchar, "CITY_MB" varchar, "PROV_STATE" varchar, "PROV_STATE_MB" varchar, "POSTAL_ZONE" varchar, "COUNTRY_CODE" int, "ORGANIZATION_CODE" varchar, "WAREHOUSE_BRANCH_CODE" int) WITH ( format = 'CSV', external_location = 's3a://lhbeta/gosales/branch' );
  3. Generate statistics with analyze table command

    If you want to use the data without creating a new copy for a different table format or more table optimizations, you can generate statistics alone with analyze table command.

    1. To generate statistics, run the following analyze table command:
      analyze <TABLE_NAME>;

      For example:

      analyze hive.gosales.branch;
  4. Convert data to analytics optimized formats (Optional)

    You can use the data for creating different table format and more table optimizations. It is recommended to convert the data files to analytics optimized format in Parquet or ORC to improve query performance, reduce server and storage resource consumption. Table format like Iceberg can provide more performance improvements and features like snapshots, time travel, and transactional support for insert, update, and delete.

    1. To create table for a data in CSV format to Parquet format, run Create table as command:

      CREATE TABLE IF NOT EXISTS
      <TABLE_NAME>
      WITH ( format = 'PARQUET')
      AS
      SELECT *
      FROM <TABLE_NAME>;

      For example:

      CREATE TABLE IF NOT EXISTS
      hive.default.branch
      WITH ( format = 'PARQUET')
      AS
      SELECT *
      FROM hive.gosales.branch;
    2. To change the table format to Iceberg, run Create table as command:
      CREATE TABLE IF NOT EXISTS
      <TABLE_NAME>
      WITH ( format = 'PARQUET')
      AS
      SELECT *
      FROM <TABLE_NAME>;

      For example:

      CREATE TABLE IF NOT EXISTS
      iceberg-beta.default.branch
      WITH ( format = 'PARQUET')
      AS
      SELECT *
      FROM hive.gosales.branch;
    Note:

    You can also include any additional SQL into the select clause for any transformations or conversion business logic or sort the data for optimized access. You can also add column partitions for more performance improvements.

    Note: Statistics are automatically generated as part of the ingest of the new table.