S3 select content from an object
The select object content API filters the content of an object through the structured query language (SQL).
Requests
In the request, you must specify the data serialization format as, comma-separated values (CSV) of the object to retrieve the specified content. Amazon Web Services (AWS) command-line interface (CLI) select object content uses the CSV format to parse object data into records and returns only the records specified in the query.
For more information and an example of the description of what should reside in the inventory object, see Metadata collected by Inventory, within the AWS Systems Manager User Guide.
s3:GetObject permission for this operation.POST /BUCKET/KEY?select&select-type=2 HTTP/1.1\r\n
POST /testbucket/sample1csv?select&select-type=2 HTTP/1.1\r\n
| Request | Description | Type | Required |
|---|---|---|---|
Bucket |
The bucket to select object content from. | String | Yes |
Key |
The object key.
Required minimum length of 1. |
String | Yes |
SelectObjectContentRequest |
Root level tag for the select object content request parameters. | String | Yes |
Expression |
The expression that is used to query the object. | String | Yes |
ExpressionType |
The type of the provided expression for example SQL.
Valid value: |
String | Yes |
InputSerialization |
Describes the format of the data in the object that is being queried. | String | Yes |
OutputSerialization |
Format of data returned in comma separator and new-line. | String | Yes |
Responses
If the action is successful, the service sends back HTTP 200 response. Data is returned in XML format by the service:
| Response | Description | Type | Required |
|---|---|---|---|
Payload |
Root level tag for the payload parameters. | String | Yes |
Records |
The records event. | Base64-encoded binary data object | No |
Stats |
The stats event. | Long | No |
Ceph Object Gateway supports the following response type:
{:event-type,records} {:content-type,application/octet-stream} {:message-type,event}
CSV requests
aws --endpoint-URL http://localhost:80 s3api select-object-content
--bucket BUCKET NAME
--expression-type SQL
--input-serialization
{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}
--output-serialization {"CSV": {}}
--key OBJECT_NAME.csv
--expression "select count(0) from s3object where int(_1)<10;" output.csv
--key OBJECT_NAME do not require .csv, .json, or .parquet.aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket testbucket
--expression-type 'SQL'
--input-serialization
'{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--key testobject.csv
--expression "select count(0) from s3object where int(_1)<10;" output.csv
Parquet requests
.csv.aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket BUCKET NAME
--expression-type 'SQL'
--input-serialization
'{"Parquet": {}, {"CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}}'
--key OBJECT_NAME.parquet
--expression "select count(0) from s3object where int(_1)<10;" output.csv
aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket testbucket
--expression-type 'SQL'
--input-serialization
'{"Parquet":{}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}}'
--key testobject.parquet
--expression "select count(0) from s3object where int(_1)<10;" output.csv
JSON requests
aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket BUCKET_NAME
--key OBJECT_NAME.json
--expression-type 'SQL'
--input-serialization
'{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--expression "select count(*) from S3Object[*];" /dev/stdout
aws s3api --endpoint-url http://localhost:80
select-object-content --bucket test
--key testobject.json
--expression-type 'SQL'
--input-serialization
'{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--expression "select count(*) from S3Object[*];" /dev/stdout
BOTO3 requests
import pprint
import boto3
from botocore.exceptions import ClientError
def run_s3select(bucket,key,query,column_delim=",",row_delim="\n",quot_char='"',esc_char='\\',csv_header_info="NONE"):
s3 = boto3.client('s3',
endpoint_url=endpoint,
aws_access_key_id=access_key,
region_name=region_name,
aws_secret_access_key=secret_key)
result = ""
try:
r = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
InputSerialization = {"CSV": {"RecordDelimiter" : row_delim, "FieldDelimiter" : column_delim,"QuoteEscapeCharacter": esc_char, "QuoteCharacter": quot_char, "FileHeaderInfo": csv_header_info}, "CompressionType": "NONE"},
OutputSerialization = {"CSV": {}},
Expression=query,
RequestProgress = {"Enabled": progress})
except ClientError as c:
result += str(c)
return result
for event in r['Payload']:
if 'Records' in event:
result = ""
records = event['Records']['Payload'].decode('utf-8')
result += records
if 'Progress' in event:
print("progress")
pprint.pprint(event['Progress'],width=1)
if 'Stats' in event:
print("Stats")
pprint.pprint(event['Stats'],width=1)
if 'End' in event:
print("End")
pprint.pprint(event['End'],width=1)
return result
run_s3select(
"my_bucket",
"my_csv_object",
"select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")
Supported features
Currently, only part of the AWS s3 select command is supported. For a full list of supported features, see Table 3.
| Features | Details | Description | Example |
|---|---|---|---|
| Arithmetic operators | ^ * % / + - ( ) |
N/A | select (int(_1)+int(_2))*int(_9) from s3object; |
| Arithmetic operators | % modulo |
N/A | select count(*) from s3object where cast(_1 as int)%2 = 0; |
| Arithmetic operators | ^ power-of |
N/A | select cast(2^10 as int) from s3object; |
| Compare operators | > < >= <= == != |
N/A | select _1,_2 from s3object where (int(_1)+int(_3))>int(_5); |
| logical operator | AND OR NOT |
N/A | select count(*) from s3object where not (int(1)>123 and int(_5)<200); |
| logical operator | is null |
Returns true/false for null indication in expression | N/A |
| logical operator and NULL | is not null |
Returns true/false for null indication in expression | N/A |
| logical operator and NULL | unknown state | Review null-handle and observe the results of logical operations with NULL. The query returns 0. |
select count(*) from s3object where null and (3>2); |
| Arithmetic operator with NULL | unknown state | Review null-handle and observe the results of binary operations with NULL. The query returns 0. |
select count(*) from s3object where (null+1) and (3>2); |
| Compare with NULL | unknown state | Review null-handle and observe results of compare operations with NULL. The query returns 0. |
select count(*) from s3object where (null*1.5) != 3; |
| missing column | unknown state | N/A | select count(*) from s3object where _1 is null; |
| projection column | Similar to if or then or else | N/A | select case when (1+1==(2+1)*3) then 'case_1' when ((4*3)==(12)) then 'case_2' else 'case_else' end, age*2 from s3object; |
| projection column | Similar to switch/case default | N/A | select case cast(_1 as int) + 1 when 2 then “a” when 3 then “b” else “c” end from s3object; |
| logical operator | N/A | coalesce returns first non-null argument |
select coalesce(nullif(5,5),nullif(1,1.0),age+12) from s3object; |
| logical operator | N/A | nullif returns null in case both arguments are equal, or else the first one,nullif(1,1)=NULL nullif(null,1)=NULL nullif(2,1)=2 |
select nullif(cast(_1 as int),cast(_2 as int)) from s3object; |
| logical operator | N/A | {expression} in ( .. {expression} ..) |
select count(*) from s3object where 'ben' in (trim(_5),substring(_1,char_length(_1)-3,3),last_name); |
| logical operator | N/A | {expression} between {expression} and {expression} |
select _1 from s3object where cast(_1 as int) between 800 and 900;; select count(*) from stdin where substring(_3,char_length(_3),1) between “x” and trim(_1) and substring(_3,char_length(_3)-1,1) = “:”; |
| logical operator | N/A | {expression} like {match-pattern} |
select count(*) from s3object where first_name like '%de_'; select count(*) from s3object where _1 like "%a[r-s]; |
| casting operator | N/A | N/A | select cast(123 as int)%2 from s3object; |
| casting operator | N/A | N/A | select cast(123.456 as float)%2 from s3object; |
| casting operator | N/A | N/A | select cast('ABC0-9' as string),cast(substr('ab12cd',3,2) as int)*4 from s3object; |
| casting operator | N/A | N/A | select cast(substring('publish on 2007-01-01',12,10) as timestamp) from s3object; |
| non AWS casting operator | N/A | N/A | select int(_1),int( 1.2 + 3.4) from s3object; |
| non AWS casting operator | N/A | N/A | select float(1.2) from s3object; |
| non AWS casting operator | N/A | N/A | select to_timestamp('1999-10-10T12:23:44Z') from s3object; |
| Aggregation Function | sun |
N/A | select sum(int(_1)) from s3object; |
| Aggregation Function | avg |
N/A | select avg(cast(_1 as float) + cast(_2 as int)) from s3object; |
| Aggregation Function | min |
N/A | select avg(cast(_1 a float) + cast(_2 as int)) from s3object; |
| Aggregation Function | max |
N/A | select max(float(_1)),min(int(_5)) from s3object; |
| Aggregation Function | count |
N/A | select count(*) from s3object where (int(1)+int(_3))>int(_5); |
| Timestamp Functions | extract |
N/A | select count(*) from s3object where extract(year from to_timestamp(_2)) > 1950 and extract(year from to_timestamp(_1)) < 1960; |
| Timestamp Functions | dateadd |
N/A | select count(0) from s3object where date_diff(year,to_timestamp(_1),date_add(day,366,to_timestamp(_1))) = 1; |
| Timestamp Functions | datediff |
N/A | select count(0) from s3object where date_diff(month,to_timestamp(_1),to_timestamp(_2)) = 2; |
| Timestamp Functions | utcnow |
N/A | select count(0) from s3object where date_diff(hour,utcnow(),date_add(day,1,utcnow())) = 24 |
| Timestamp Functions | to_string |
N/A | select to_string( to_timestamp(“2009-09-17T17:56:06.234567Z”), “yyyyMMdd-H:m:s”) from s3object; |
| String Functions | substring |
N/A | select count(0) from s3object where int(substring(_1,1,4))>1950 and int(substring(_1,1,4))<1960; |
| String Functions | substring |
substring with from negative number is valid considered as first | select substring(“123456789” from -4) from s3object; |
| String Functions | substring |
substring with from zero for out-of-bound number is valid just as (first,last) | select substring(“123456789” from 0 for 100) from s3object; |
| String Functions | trim |
N/A | select trim(' foobar ') from s3object; |
| String Functions | trim |
N/A | select trim(trailing from ' foobar ') from s3object; |
| String Functions | trim |
N/A | select trim(leading from ' foobar ') from s3object; |
| String Functions | trim |
N/A | select trim(both '12' from '1112211foobar22211122') from s3object; |
| String Functions | lower or upper |
N/A | select lower('ABcD12#$e') from s3object; |
| String Functions | char_length, character_length |
N/A | select count(*) from s3object where char_length(_3)=3; |
| Complex queries | N/A | N/A | select sum(cast(_1 as int)),max(cast(_3 as int)), substring('abcdefghijklm', (2-1)*3+sum(cast(_1 as int))/sum(cast(_1 as int))+1, (count() + count(0))/count(0)) from s3object; |
| alias support | N/A | N/A | select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300; |