S3 JSON parsing
The values in a JSON document can be nested within objects or arrays without limitations. When
querying a specific value in a JSON document in the S3 select engine, the location of the value is
specified via a path in the SELECT statement.
The generic structure of a JSON document does not have a row and column structure like CSV and Parquet. Instead, it is the SQL statement itself that defines the rows and columns when querying a JSON document.
The S3 select engine's JSON parser parses S3-objects as follows:
- The
FROMclause in theSELECTstatement defines the row boundaries - A row in a JSON document is similar to how the row delimiter is used to define rows for CSV objects, and how row groups are used to define rows for Parquet objects
- Consider the following example:
{ "firstName": "Joe", "lastName": "Jackson", "gender": "male", "age": "twenty", "address": { "streetAddress": "101", "city": "San Diego", "state": "CA" }, "firstName": "Joe_2", "lastName": "Jackson_2", "gender": "male", "age": 21, "address": { "streetAddress": "101", "city": "San Diego", "state": "CA" }, "phoneNumbers": [ { "type": "home1", "number": "734928_1","addr": 11 }, { "type": "home2", "number": "734928_2","addr": 22 } ], "key_after_array": "XXX", "description" : { "main_desc" : "value_1", "second_desc" : "value_2" } } # the from-clause define a single row. # _1 points to root object level. # _1.age appears twice in Documnet-row, the last value is used for the operation. query = "select _1.firstname,_1.key_after_array,_1.age+4,_1.description.main_desc,_1.description.second_desc from s3object[*];"; expected_result = Joe_2,XXX,25,value_1,value_2 # the from-clause points the phonenumbers array (it defines the _1) # each element in phoneNumbers array define a row. # in this case each element is an object contains 3 keys/values. # the query "can not access" values outside phonenumbers array, the query can access only values appears on _1.phonenumbers path. query = "select cast(substring(_1.number,1,6) as int) *10 from s3object[*].phonenumbers where _1.type='home2';"; expected_result = 7349280- The statement instructs the reader to search for the path aa.bb.cc and defines the row boundaries based on the occurrence of this path.
- A row begins when the reader encounters the path, and it ends when the reader exits the innermost part of the path, which in this case is the object cc.