Integrate the Ceph Object Gateway with Trino, an important utility that enables the user
to run SQL queries 9x faster on S3 objects.
Before you begin
- A running IBM Storage
Ceph cluster with Ceph Object Gateway
installed.
- Docker or Podman installed.
- Buckets created.
- Objects are uploaded.
About this task
You can integrate Ceph Object Gateway with Trino for S3 select operations.
The following are some benefits of using Trino:
- Trino is a complete SQL engine.
- Pushes down S3 select requests wherein the Trino engine identifies part of the SQL statement
that is cost effective to run on the server-side.
- Uses the optimization rules of Ceph/S3select to enhance performance.
- Leverages IBM Storage
Ceph scalability and divides the
original object into multiple equal parts, performs S3 select requests, and merges the request.
Important: If the s3select syntax does not work while querying through
trino, use the SQL syntax.
Procedure
- Deploy Trino and hive.
- Clone the
repository.
[cephuser@host01 ~]$ git clone https://github.com/ceph/s3select.git
[cephuser@host01 ~]$ cd s3select
- Modify the
hms_trino.yaml file with S3 endpoint, access key, and secret
key.[cephuser@host01 s3select]$ cat container/trino/hms_trino.yaml
version: '3'
services:
hms:
image: galsl/hms:dev
container_name: hms
environment:
# S3_ENDPOINT the CEPH/RGW end-point-url
- S3_ENDPOINT=http://rgw_ip:port
- S3_ACCESS_KEY=abc
- S3_SECRET_KEY=abc
# the container starts with booting the hive metastore
command: sh -c '. ~/.bashrc; start_hive_metastore'
ports:
- 9083:9083
networks:
- trino_hms
trino:
image: trinodb/trino:405
container_name: trino
volumes:
# the trino directory contains the necessary configuration
- ./trino:/etc/trino
ports:
- 8080:8080
networks:
- trino_hms
networks:
trino_hm
- Modify the
hive.properties file with S3 endpoint, access key, and secret
key.[cephuser@host01 s3select]$ cat container/trino/trino/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://hms:9083
#hive.metastore.warehouse.dir=s3a://hive/
hive.allow-drop-table=true
hive.allow-rename-table=true
hive.allow-add-column=true
hive.allow-drop-column=true
hive.allow-rename-column=true
hive.non-managed-table-writes-enabled=true
hive.s3select-pushdown.enabled=true
hive.s3.aws-access-key=abc
hive.s3.aws-secret-key=abc
# should modify per s3-endpoint-url
hive.s3.endpoint=http://rgw_ip:port
#hive.s3.max-connections=1
#hive.s3select-pushdown.max-connections=1
hive.s3.connect-timeout=100s
hive.s3.socket-timeout=100s
hive.max-splits-per-second=10000
hive.max-split-size=128MB
- Start a Trino container to integrate Ceph Object Gateway.
[cephuser@host01 s3select]$ sudo docker compose -f ./container/trino/hms_trino.yaml up -d
- Verify integration.
[cephuser@host01 s3select]$ sudo docker exec -it trino /bin/bash
trino@66f753905e82:/$ trino
trino> create schema hive.csvbkt1schema;
trino> create table hive.csvbkt1schema.polariondatacsv(c1 varchar,c2 varchar, c3 varchar, c4 varchar, c5 varchar, c6 varchar, c7 varchar, c8 varchar, c9 varchar) WITH ( external_location = 's3a://csvbkt1/',format = 'CSV');
trino> select * from hive.csvbkt1schema.polariondatacsv;
Note: The external location must point to the bucket name or a
directory, and not the end file.