Analyzing data stored on S3 or S3 compatible object store services (Db2 Big SQL)

You can use Db2 Big SQL to access data that is stored on private or public cloud object stores that are accessible through an S3 compatible interface, and then run SQL analytics on that data.

Before you begin

You must have a data file that is stored in a bucket or in an object store service such as IBM Cloud Object Store.

Configuring Db2 Big SQL to connect to an S3 or S3 compatible object store service

The Db2 Big SQL instance must be configured to access data that is stored on an S3 compatible object store service. This configuration is done when an IBM® Software Hub administrator provisions the Db2 Big SQL instance.

When the connecting to a remote data source stage in the provisioning process is reached, ask the administrator do the following steps.

  1. Select the Configure object store checkbox.
  2. In the Object store service endpoint box, enter an endpoint for the S3 compatible object store service that you want to connect to.

    For example, s3.eu-gb.cloud-object-storage.appdomain.cloud.

  3. In the Object store access ID box, enter an access key for the S3 compatible object store service.
  4. In the Object store service secret box, enter the matching secret key.
  5. If you want to restrict access to a single bucket, select the Specify the object store bucket name checkbox, and provide the bucket name.

    If the checkbox is not selected, Db2 Big SQL can access any bucket that is accessible by the provided endpoint and credentials.

Figure 1. Specifying object store configuration information
Object store dialog where configuration information is entered.
Important: The administrator that provisions the Db2 Big SQL instance does not have JDBC access to it. The administrator must grant access to other users, and then those users can access the instance.

Identifying the JDBC URL to connect to Db2 Big SQL

Do the following steps to identify the JDBC URL that is needed to connect to Db2 Big SQL.

  1. As a user that was granted access to the Db2 Big SQL instance, log in to IBM Cloud Pak for Data web interface.
  2. Open the Instance page and locate the Db2 Big SQL instance that was configured to access the object store.
    Tip: If you do not see the Db2 Big SQL instance in the table of instances, it is probably because you were not yet granted access by the IBM Software Hub administrator or the user that created the Db2 Big SQL instance.
  3. Click the Db2 Big SQL instance name.

    The Db2 Big SQL instance details page opens.

  4. In the Access information section, check that the JDBC URL field shows the External host name • Non-SSL parameters.

    If you do not see External host name • Non-SSL, click the Edit icon and update the parameters.

    Figure 2. Db2 Big SQL instance page
    Page that shows details about the Db2 Big SQL instance, such as instance name, version, and access information.
  5. Copy the JDBC URL.

    The URL is needed in the next section.

  6. Log out of the IBM Software Hub web interface.

Creating a table from data that is stored on an object store service

From a JDBC interactive client, such as a Jupyter notebook, you can create a table from data that is stored in a bucket that is on an object store service that the Db2 Big SQL instance is connected to. Connect Db2 Big SQL to the object store service by using the URL that is identified in the previous section.

In the following example, the statement expects a bucket s3bucket with a directory tpcds/customer to exist, and to contain the data files for the customer table that is being created. The data files are expected to be in a value separated file format that uses the | character as a separator.

CREATE EXTERNAL HADOOP TABLE customer
(
    c_customer_sk             int                  not null,
    c_customer_id             varchar(16)          not null,
    c_current_cdemo_sk        int                          ,
    c_current_hdemo_sk        int                          ,
    c_current_addr_sk         int                          ,
    c_first_shipto_date_sk    int                          ,
    c_first_sales_date_sk     int                          ,
    c_salutation              varchar(10)                  ,
    c_first_name              varchar(20)                  ,
    c_last_name               varchar(30)                  ,
    c_preferred_cust_flag     varchar(1)                   ,
    c_birth_day               bigint                       ,
    c_birth_month             bigint                       ,
    c_birth_year              bigint                       ,
    c_birth_country           varchar(20)                  ,
    c_login                   varchar(13)                  ,
    c_email_address           varchar(50)                  ,
    c_last_review_date        int
)
STORED AS TEXTFILE
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3a://s3bucket/tpcds/customer'
TBLPROPERTIES('serialization.null.format'='');
Note: The LOCATION clause must always specify a directory that contains the files for the table that is being created. The files must have the same structure. If you have files with different structures and you want to create a different table for each structure, then the files must be in separate directories.

You can now query the table by using SQL. For example,

[localhost] [user1] 1> SELECT c_birth_country, count (c_customer_sk) AS count
[localhost] [user1] 2> FROM customer
[localhost] [user1] 3> GROUP BY c_birth_country
[localhost] [user1] 4> HAVING count > 500
[localhost] [user1] 5> ORDER BY count DESC;
+-----------------+-------+
| C_BIRTH_COUNTRY | COUNT |
+-----------------+-------+
| [NULL]          | 3439  |
| VANUATU         |  532  |
| BERMUDA         |  516  |
| LESOTHO         |  506  |
| MAURITIUS       |  506  |
| LIBERIA         |  501  |
+-----------------+-------+
6 rows in results (first row: 2.836s; total: 2.838s)
[localhost] [user1] 1>

If no data files exist in the path that is specified in the LOCATION clause, the table creation succeeds, but the table is empty. You can use Db2 Big SQL to insert data into the table.

Inferring the table structure from a remote file

You can use the CREATE TABLE (HADOOP) statement to infer table structure from a data file with a certain file format.

When you use the LIKE clause on the CREATE TABLE (HADOOP) statement, you can specify that the table columns are to have a name and type that are most similar to columns in a file at the specified URL. If the given URL identifies a directory, a file is selected at random from within that directory. Use the STORED AS clause to specify the file format, which must be one of ORC, PARQUET, PARQUETFILE, TEXTFILE, or JSONFILE.

By default, the new table is stored separately from the directory that is identified by the specified URL. However, you can use the LOCATION clause to store the table in the same directory. The file formats of the source and target tables should be the same.

Note: It is recommended to review the data types of the columns in the newly created table to ensure accurate and optimal data types have been used for the new table. If needed, use the ALTER TABLE (HADOOP) statement to make any changes to the table to ensure accuracy and optimal performance, which includes eliminating STRING types. Use of the STRING type in Db2 Big SQL is discouraged because of the resulting increase in storage requirements and the potential for reduced performance. See Data types and performance. See also STRING. It is critical to ensure that the data types of partitioning columns are properly defined, altering the data types as needed.
For example:
CREATE EXTERNAL HADOOP TABLE customer
  LIKE 's3a://monsoon/customer/customer.parquet'
  STORED AS PARQUET
  LOCATION 's3a://monsoon/customer';

For more information, see CREATE TABLE (HADOOP) statement.