S3

Db2® Big SQL tables can be created in a location that is specified as an S3a Object Storage URI. Such tables store data in the Object Storage service, which is typically remote to the Db2 Big SQL cluster and managed separately by, for example, IBM®'s Cloud Object Storage - S3 API service that is available through IBM Cloud.

About this task

Simple Storage Service (S3) is an object store interface protocol that was invented by Amazon. S3a is the name of a Hadoop component that understands the S3 protocol. S3a provides an interface for Hadoop services, such as Db2 Big SQL, to consume S3-hosted data.

Before you can use an S3a uniform resource identifier (URI) as a location in a Hive or Db2 Big SQL table, you must add the connection information about the S3a Object Storage service to the core-site.xml file. Add the properties in Table 1 for the S3a file system configuration. After you add these properties, restart the HDFS, MapReduce, YARN, Hive, and Db2 Big SQL services. Obtain the authentication and connection property values for your Object Storage service to use here.

Table 1. Properties for the S3a file system configuration
Property name Property value
fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.endpoint The endpoint from which the S3 service is provided. IBM's Cloud Object Storage – S3 API is available from many regions. For example, the San Jose region's endpoint is s3-api.sjc-us-geo.objectstorage.service.networklayer.com.
fs.s3a.secret.key The required secret key from the S3 endpoint.
fs.s3a.access.key The required access key from the S3 endpoint.
fs.s3a.bucket.bucket-name.endpoint This property is used for connections to a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.endpoint. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.endpoint.
fs.s3a.bucket.bucket-name.secret.key This property is used for authentication of a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.secret.key. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.secret.key.
fs.s3a.bucket.bucket-name.access.key This property is used for authentication of a specific S3 bucket. The bucket-name is the name of the bucket that is used in the URI. For example, in URI s3a://s3atables/country, the property name is fs.s3a.bucket.s3atables.access.key. You can also use a longer version of this property, whose format is fs.s3a.bucket.bucket-name.fs.s3a.access.key.

When using IBM Cloud COS (Cloud Object Store) you should generate HMAC credentials which will have the access_key_id and secret_access_key properties needed for S3 connection. Through the IBM Cloud COS UI you can create HMAC credentials following these steps. Make sure that {"HMAC":true} is specified in the parameters field.

Use the value of access_key_id for fs.s3a.access.key and the value of secret_access_key for fs.s3a.secret.key in the core-site.xml.

Limitations:

Adding these fs.s3a.* properties to the core-site.xml file can expose the credentials in plain text to anyone with access to the core-site.xml file. If you prefer not to expose this information, see Enabling Db2 Big SQL to use a credential keystore file.

Having the credentials globally available through the core-site.xml file makes data that is stored on the object storage server publicly accessible to anyone with access to the Hadoop cluster. The data does not have an owner, group, or permissions set. Db2 Big SQL can use GRANT statements to restrict access to a table, but the data is still accessible to everyone through Hadoop shell commands. Note that Db2 Big SQL impersonation is not applicable to these tables.

Accessing S3a buckets and files through Hadoop

About this task

S3a Object Storage has a one-layer hierarchy. A bucket is a top level that contains files. From an S3 client (available as an external download), you can list the files in a bucket. For example, the following list shows the files in bucket s3atables, using the IBM Cloud San Jose S3 endpoint.

# aws s3 ls s3atables --recursive --endpoint-url=https://s3-api.sjc-us-geo.objectstorage.softlayer.net
2016-07-27 14:51:16 922 country/country.csv
2016-07-27 18:11:22 367 staff/job=Clerk/l1469234296227-6-r-00000
2016-07-27 18:03:29 286 staff/job=Mgr/l1469234296227-6-r-00003
2016-07-27 18:13:50 362 staff/job=Sales/l1469234296227-6-r-00003
You can use the Hadoop or HDFS shell commands and the S3a file system URI to show file and bucket information. The Hadoop shell commands (-ls, -mkdir, -put, -get, -cp, -mv, -cat, and others) can be used with the S3a file system on your Hadoop cluster. The format of the S3a file system URI is s3a://<bucket>/<path>. For example:

# hdfs dfs -ls s3a://s3atables/
Found 1 items
drwxrwxrwx - 0 2016-07-27 14:39 s3a://s3atables/country

# hdfs dfs -ls s3a://s3atables/country
Found 1 items
-rw-rw-rw- 1 922 2016-07-27 14:51 s3a://s3atables/country/country.csv

By using an S3a client, you can create a bucket and add files in the S3 Object Storage service.

The Hadoop S3a file system implementation allows zero-length files to be treated as directories, and file names that contain a forward slash (/) are treated as nested directories. For example, the URI s3a://s3atables/staff refers to a zero-length file named staff in bucket s3atables. Zero-length file staff/job=Clerk is treated as nested directories, and file staff/job=Clerk/l1469234296227-6-r-00000 is treated as a file in that directory.

# hdfs dfs -ls -R s3a://s3atables/staff
drwxrwxrwx - 0 2016-07-27 18:10 s3a://s3atables/staff/job=Clerk
-rw-rw-rw- 1 367 2016-07-27 18:11 s3a://s3atables/staff/job=Clerk/l1469234296227-6-r-00000
drwxrwxrwx - 0 2016-07-27 17:57 s3a://s3atables/staff/job=Mgr
-rw-rw-rw- 1 286 2016-07-27 18:03 s3a://s3atables/staff/job=Mgr/l1469234296227-6-r-00003
drwxrwxrwx - 0 2016-07-27 18:08 s3a://s3atables/staff/job=Sales
-rw-rw-rw- 1 362 2016-07-27 18:13 s3a://s3atables/staff/job=Sales/l1469234296227-6-r-00003

Creating a Db2 Big SQL table with S3a Object Storage location

About this task

You can create a Db2 Big SQL table by specifying a location that is an S3a file system URI. A zero-length file is created in the S3a bucket for the table. For example:

CREATE EXTERNAL HADOOP TABLE country (
  SALESCOUNTRYCODE INT,
  COUNTRY VARCHAR(40),
  ISOTHREELETTERCODE VARCHAR(3),
  ISOTWOLETTERCODE VARCHAR(2),
  ISOTHREEDIGITCODE VARCHAR(3),
  CURRENCYNAME VARCHAR(50),
  EUROINUSESINCE TIMESTAMP
) LOCATION 's3a://s3atables/country';
Note: The LOCATION clause must always specify a directory that contains the files for the table that is being created. The files must have the same structure. If you have files with different structures and you want to create a different table for each structure, the files must be in separate directories.
You can also create a partitioned table. For example:
CREATE HADOOP TABLE staff (
  ID SMALLINT,
  NAME VARCHAR(9),
  DEPT SMALLINT,
  YEARS SMALLINT,
  SALARY DECIMAL(7,2),
  COMM DECIMAL(7,2)
) PARTITIONED BY (JOB VARCHAR(5))
  LOCATION 's3a://s3atables/staff';
Here is an example of loading data from an S3a source file:

LOAD HADOOP USING FILE URL 's3a://data-files/staff.csv'
  WITH SOURCE PROPERTIES (
    'field.indexes'='1,2,3,5,6,7,4',
    'replace.with.null'='NULL',
    'replace.string.with.null'='NULL')
  INTO TABLE staff;

Enabling Db2 Big SQL to use a credential keystore file

About this task

When the Object Storage URI is used as the location for a Db2 Big SQL table, the connection secret key and access key can be stored in an encrypted credential file instead of the core-site.xml file. Two credential keystore files are required for Db2 Big SQL and the other Hadoop services to work. The first keystore file will be created with the Hadoop credential command and the second will be created with the Db2 Big SQL $BIGSQL_HOME/libexec/credential.sh script.

Procedure

  1. Complete the steps in Enabling Hadoop services to use a credential keystore file. This will enable all the other services to also use the credential keystore file.
  2. Create a credential keystore on the HDFS by using the $BIGSQL_HOME/libexec/credential.sh script:
    $BIGSQL_HOME/libexec/credential.sh create fs.s3a.access.key -provider jceks://<path>/<jceks_filename>.jceks -v <access key>
    For example, if the jceks file will be created on the HDFS as /hdfs/tmp/s3a_bigsql.jceks:
    $BIGSQL_HOME/libexec/credential.sh create fs.s3a.access.key -provider jceks://hdfs/tmp/s3a_bigsql.jceks -v 'myaccesskey'
    $BIGSQL_HOME/libexec/credential.sh create fs.s3a.secret.key -provider jceks://hdfs/tmp/s3a_bigsql.jceks -v 'mysecretkey'
  3. You can verify the credential keystore file by using the list command:
    $BIGSQL_HOME/libexec/credential.sh list -provider jceks://hdfs/tmp/s3a_bigsql.jceks
    Listing aliases for CredentialProvider: jceks://hdfs/tmp/s3a_bigsql.jceks
    fs.s3a.access.key
    fs.s3a.secret.key
    
  4. Set read permissions for the keystore file so that all users can read the file:
    hadoop fs -chmod ugo+r /tmp/s3a_bigsql.jceks
    hadoop fs -ls /tmp/s3a_bigsql.jceks
     -rwxr-\-r-\- 3 hdfs hdfs 1045 2016-08-15 17:01 /tmp/s3a_bigsql.jceks
    
  5. Add or update the fs.s3a.security.credential.provider.path property in the bigsql-conf.xml file on every Db2 Big SQL node to include the provider URI from Step 1 above. In this example, the provider is jceks://hdfs/tmp/s3a_bigsql.jceks. Db2 Big SQL will use the JCEKS files that are listed in the bigsql-conf.xml file, whereas the other services (for example, Hive, YARN, and MapReduce) will use the JCEKS files that are listed in the core-site.xml file.
  6. Restart the HDFS, MapReduce, YARN, Hive, HBase, Db2 Big SQL, and any other affected service.