S3 JSON parsing

The values in a JSON document can be nested within objects or arrays without limitations. When querying a specific value in a JSON document in the S3 select engine, the location of the value is specified via a path in the SELECT statement.

The generic structure of a JSON document does not have a row and column structure like CSV and Parquet. Instead, it is the SQL statement itself that defines the rows and columns when querying a JSON document.

The S3 select engine's JSON parser parses S3-objects as follows:
  • The FROM clause in the SELECT statement defines the row boundaries
  • A row in a JSON document is similar to how the row delimiter is used to define rows for CSV objects, and how row groups are used to define rows for Parquet objects
  • Consider the following example:
    {
    "firstName": "Joe",
    "lastName": "Jackson",
    "gender": "male",
    "age": "twenty",
    "address":
      {
       "streetAddress": "101",
       "city": "San Diego",
       "state": "CA"
      },
    "firstName": "Joe_2",
    "lastName": "Jackson_2",
    "gender": "male",
    "age": 21,
    "address": 
     {
      "streetAddress": "101",
      "city": "San Diego",
      "state": "CA"
     },
    "phoneNumbers": 
     [
      { "type": "home1", "number": "734928_1","addr": 11 },
      { "type": "home2", "number": "734928_2","addr": 22 }
     ],
    "key_after_array": "XXX",
    "description" : 
     {
      "main_desc" : "value_1",
      "second_desc" : "value_2"
     }
    }
    
     # the from-clause define a single row.
     # _1 points to root object level.
     # _1.age appears twice in Documnet-row, the last value is used for the operation.
     query = "select _1.firstname,_1.key_after_array,_1.age+4,_1.description.main_desc,_1.description.second_desc from s3object[*];";
     expected_result = Joe_2,XXX,25,value_1,value_2
    
    
     # the from-clause points the phonenumbers array (it defines the _1)
     # each element in phoneNumbers array define a row.
     # in this case each element is an object contains 3 keys/values.
     # the query "can not access" values outside phonenumbers array, the query can access only values appears on _1.phonenumbers path.
     query = "select cast(substring(_1.number,1,6) as int) *10 from s3object[*].phonenumbers where _1.type='home2';";
     expected_result = 7349280
    • The statement instructs the reader to search for the path aa.bb.cc and defines the row boundaries based on the occurrence of this path.
    • A row begins when the reader encounters the path, and it ends when the reader exits the innermost part of the path, which in this case is the object cc.