Integrating Ceph Object Gateway with Trino

Edit online

Integrate the Ceph Object Gateway with Trino, an important utility that enables the user to run SQL queries 9x faster on S3 objects.

Before you begin

A running IBM Storage Ceph cluster with Ceph Object Gateway installed.
Docker or Podman installed.
Buckets created.
Objects are uploaded.

About this task

You can integrate Ceph Object Gateway with Trino for S3 select operations.

The following are some benefits of using Trino:

Trino is a complete SQL engine.
Pushes down S3 select requests wherein the Trino engine identifies part of the SQL statement that is cost effective to run on the server-side.
Uses the optimization rules of Ceph/S3select to enhance performance.
Leverages IBM Storage Ceph scalability and divides the original object into multiple equal parts, performs S3 select requests, and merges the request.

Important: If the s3select syntax does not work while querying through trino, use the SQL syntax.

Procedure

Deploy Trino and hive.

Clone the repository.

[cephuser@host01 ~]$ git clone https://github.com/ceph/s3select.git
[cephuser@host01 ~]$ cd s3select

Modify the hms_trino.yaml file with S3 endpoint, access key, and secret key.

[cephuser@host01 s3select]$ cat container/trino/hms_trino.yaml
version: '3'
services:
  hms:
    image: galsl/hms:dev
    container_name: hms
    environment:
      # S3_ENDPOINT the CEPH/RGW end-point-url 
      - S3_ENDPOINT=http://rgw_ip:port
      - S3_ACCESS_KEY=abc
      - S3_SECRET_KEY=abc
    # the container starts with booting the hive metastore
    command: sh -c '. ~/.bashrc; start_hive_metastore'
    ports:
      - 9083:9083
    networks:
      - trino_hms

  trino:
    image: trinodb/trino:405
    container_name: trino
    volumes:
      # the trino directory contains the necessary configuration
      - ./trino:/etc/trino
    ports:
      - 8080:8080
    networks:
      - trino_hms

networks:
  trino_hm

Modify the hive.properties file with S3 endpoint, access key, and secret key.

[cephuser@host01 s3select]$ cat container/trino/trino/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://hms:9083 

#hive.metastore.warehouse.dir=s3a://hive/

hive.allow-drop-table=true
hive.allow-rename-table=true
hive.allow-add-column=true
hive.allow-drop-column=true
hive.allow-rename-column=true

hive.non-managed-table-writes-enabled=true
hive.s3select-pushdown.enabled=true
hive.s3.aws-access-key=abc
hive.s3.aws-secret-key=abc

# should modify per s3-endpoint-url
hive.s3.endpoint=http://rgw_ip:port
#hive.s3.max-connections=1
#hive.s3select-pushdown.max-connections=1

hive.s3.connect-timeout=100s
hive.s3.socket-timeout=100s
hive.max-splits-per-second=10000
hive.max-split-size=128MB

Start a Trino container to integrate Ceph Object Gateway.

[cephuser@host01 s3select]$ sudo docker compose -f ./container/trino/hms_trino.yaml up -d

Verify integration.

[cephuser@host01 s3select]$ sudo docker exec -it trino /bin/bash
trino@66f753905e82:/$ trino
trino> create schema hive.csvbkt1schema;
trino> create table hive.csvbkt1schema.polariondatacsv(c1 varchar,c2 varchar, c3 varchar, c4 varchar, c5 varchar, c6 varchar, c7 varchar, c8 varchar, c9 varchar) WITH ( external_location = 's3a://csvbkt1/',format = 'CSV');
trino> select * from hive.csvbkt1schema.polariondatacsv;

Note: The external location must point to the bucket name or a directory, and not the end file.