Important: IBM Cloud Pak® for Data
Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the
Discontinuance of service announcement for
IBM Cloud Pak for Data Version 4.X.
Upgrade to IBM Software Hub Version
5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of
support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1
documentation.
In this tutorial, you learn to move data into a data lake or an object storage bucket and
load the data files to Presto. You learn to optimize the file format to choose the table format and
run complex SQL query in watsonx.data.
Before you begin
This tutorial requires:
- Subscription of watsonx.data on
cloud.
- The configuration details of data bucket that you bring in. This is required for establishing
connection with the watsonx.data.
- Ensure that the data bucket has data.
About this task
Scenario: You need to run SQL query on data files that is in your object storage bucket. For
this, you must attach the data files in your object storage bucket to Presto. You can also convert
data into an optimized analytical format in Parquet or ORC to enhance query performance and reduce
server and storage resource consumption. Now, you can run SQL query against the table you
created.
The objectives of this tutorial are listed as follows:
Procedure
-
Uploading data into an object storage bucket and attaching to Presto
In this section of the tutorial, you are going to manage data in an object storage bucket and
attach the bucket to HMS and associate with Presto engine.
-
Access any one of the object storage access tools like S3 Browser, AWS S3 console, direct S3
APIs, and various CLI/UI object storage tools.
-
Load data files to your object storage bucket by using the tool.
-
Register and attach the object storage bucket to HMS and associate with Presto engine by using
watsonx.data UI.
-
Alternatively, you can also register and attach an object storage bucket with pre-existing data
to HMS.
- Load data files into Presto
After you attach the object storage bucket to HMS, you need to load data files into Presto by
creating schema and external tables through the Hive connector.
- Run the following command to create schema for the data you want to
access.
CREATE SCHEMA <SCHEMA_NAME> WITH ( location = '<SCHEMA_LOCATION>' );
For
example:
CREATE SCHEMA hive.gosales WITH ( location = 's3a://lhbeta/gosales' );
- Run the following command to create table by using an external location by pointing to an
uploaded data
file.
CREATE TABLE IF NOT EXISTS <TABLE_NAME> ("<COLUMN_NAMES>" <DATA_TYPE>) WITH ( format = '<DATA_FORMAT>', external_location = '<DATA_FILE_LOCATION>' );
For
example:
CREATE TABLE IF NOT EXISTS hive.gosales.branch ("BRANCH_CODE" int, "ADDRESS1" varchar, "ADDRESS1_MB" varchar, "ADDRESS2" varchar, "ADDRESS2_MB" varchar, "CITY" varchar, "CITY_MB" varchar, "PROV_STATE" varchar, "PROV_STATE_MB" varchar, "POSTAL_ZONE" varchar, "COUNTRY_CODE" int, "ORGANIZATION_CODE" varchar, "WAREHOUSE_BRANCH_CODE" int) WITH ( format = 'CSV', external_location = 's3a://lhbeta/gosales/branch' );
- Generate statistics with
analyze table command
If you want to use the data without creating a new copy for a different table format or more
table optimizations, you can generate statistics alone with analyze table
command.
- To generate statistics, run the following
analyze table
command:analyze <TABLE_NAME>;
For
example:
analyze hive.gosales.branch;
- Convert data to analytics optimized formats (Optional)
You can use the data for creating different table format and more table optimizations. It is
recommended to convert the data files to analytics optimized format in Parquet or ORC to improve
query performance, reduce server and storage resource consumption. Table format like Iceberg can
provide more performance improvements and features like snapshots, time travel, and transactional
support for insert, update, and delete.
-
To create table for a data in CSV format to Parquet format, run Create table as command:
CREATE TABLE IF NOT EXISTS
<TABLE_NAME>
WITH ( format = 'PARQUET')
AS
SELECT *
FROM <TABLE_NAME>;
For example:
CREATE TABLE IF NOT EXISTS
hive.default.branch
WITH ( format = 'PARQUET')
AS
SELECT *
FROM hive.gosales.branch;
- To change the table format to Iceberg, run
Create table as
command:CREATE TABLE IF NOT EXISTS
<TABLE_NAME>
WITH ( format = 'PARQUET')
AS
SELECT *
FROM <TABLE_NAME>;
For example:
CREATE TABLE IF NOT EXISTS
iceberg-beta.default.branch
WITH ( format = 'PARQUET')
AS
SELECT *
FROM hive.gosales.branch;
Note:
You can also include any additional SQL into the select clause for any transformations or
conversion business logic or sort the data for optimized access. You can also add column partitions
for more performance improvements.
Note: Statistics are automatically generated as part of the ingest of the new table.