Integrating Ceph Object Gateway with Trino

Integrate the Ceph Object Gateway with Trino, an important utility that enables the user to run SQL queries 9x faster on S3 objects.

Before you begin

  • A running IBM Storage Ceph cluster with Ceph Object Gateway installed.
  • Docker or Podman installed.
  • Buckets created.
  • Objects are uploaded.

About this task

You can integrate Ceph Object Gateway with Trino for S3 select operations.

The following are some benefits of using Trino:
  • Trino is a complete SQL engine.
  • Pushes down S3 select requests wherein the Trino engine identifies part of the SQL statement that is cost effective to run on the server-side.
  • Uses the optimization rules of Ceph/S3select to enhance performance.
  • Leverages IBM Storage Ceph scalability and divides the original object into multiple equal parts, performs S3 select requests, and merges the request.
Important: If the s3select syntax does not work while querying through trino, use the SQL syntax.

Procedure

  1. Deploy Trino and hive.
    1. Clone the repository.
      [cephuser@host01 ~]$ git clone https://github.com/ceph/s3select.git
      [cephuser@host01 ~]$ cd s3select
    2. Modify the hms_trino.yaml file with S3 endpoint, access key, and secret key.
      [cephuser@host01 s3select]$ cat container/trino/hms_trino.yaml
      version: '3'
      services:
        hms:
          image: galsl/hms:dev
          container_name: hms
          environment:
            # S3_ENDPOINT the CEPH/RGW end-point-url 
            - S3_ENDPOINT=http://rgw_ip:port
            - S3_ACCESS_KEY=abc
            - S3_SECRET_KEY=abc
          # the container starts with booting the hive metastore
          command: sh -c '. ~/.bashrc; start_hive_metastore'
          ports:
            - 9083:9083
          networks:
            - trino_hms
      
        trino:
          image: trinodb/trino:405
          container_name: trino
          volumes:
            # the trino directory contains the necessary configuration
            - ./trino:/etc/trino
          ports:
            - 8080:8080
          networks:
            - trino_hms
      
      networks:
        trino_hm
      
    3. Modify the hive.properties file with S3 endpoint, access key, and secret key.
      [cephuser@host01 s3select]$ cat container/trino/trino/catalog/hive.properties
      connector.name=hive
      hive.metastore.uri=thrift://hms:9083 
      
      #hive.metastore.warehouse.dir=s3a://hive/
      
      hive.allow-drop-table=true
      hive.allow-rename-table=true
      hive.allow-add-column=true
      hive.allow-drop-column=true
      hive.allow-rename-column=true
      
      hive.non-managed-table-writes-enabled=true
      hive.s3select-pushdown.enabled=true
      hive.s3.aws-access-key=abc
      hive.s3.aws-secret-key=abc
      
      # should modify per s3-endpoint-url
      hive.s3.endpoint=http://rgw_ip:port
      #hive.s3.max-connections=1
      #hive.s3select-pushdown.max-connections=1
      
      hive.s3.connect-timeout=100s
      hive.s3.socket-timeout=100s
      hive.max-splits-per-second=10000
      hive.max-split-size=128MB
      
  2. Start a Trino container to integrate Ceph Object Gateway.
    [cephuser@host01 s3select]$ sudo docker compose -f ./container/trino/hms_trino.yaml up -d
  3. Verify integration.
    [cephuser@host01 s3select]$ sudo docker exec -it trino /bin/bash
    trino@66f753905e82:/$ trino
    trino> create schema hive.csvbkt1schema;
    trino> create table hive.csvbkt1schema.polariondatacsv(c1 varchar,c2 varchar, c3 varchar, c4 varchar, c5 varchar, c6 varchar, c7 varchar, c8 varchar, c9 varchar) WITH ( external_location = 's3a://csvbkt1/',format = 'CSV');
    trino> select * from hive.csvbkt1schema.polariondatacsv;
    Note: The external location must point to the bucket name or a directory, and not the end file.