Optim High Performance Unload configuration for Big Data or external Db2 destinations

This chapter describes the necessary configuration file and parameters for Big Data or external Db2® destinations.

Configuration file description

Optim™ High Performance Unload can interact with Big Data or external Db2 destinations specified through the LOADDEST clause, for transferring data extracted with Optim High Performance Unload to these destinations. These interactions can consist in migrating data automatically, or in generating a data transfer command to be executed later. Among the Big Data destinations supported, there are NoSQL, Hadoop or Object Storage ones. The external Db2 destinations supported are the remote databases cataloged locally, and the Db2 Warehouse ones.

In order to perform a task interacting with a supported destination, some information might need to be set through the usage of the configuration file called db2hpu.dest. It is true when the destination considered is a Big Data environment, or when it is a Db2 Warehouse database, or when it is a remote Db2 database and the corresponding LOADDEST clause contains the specification of a node name or of its WITH STANDARD AUTH option. This configuration file is located in the Optim High Performance Unload configuration directory.

The db2hpu.dest file must be organized by sections, each one corresponding to a given destination. Each section must be organized like the following:

a section name, specified between brackets, identifying the associated destination. Note that the section name value is case sensitive. The possible values for the section name are:
- Cloudant: this is the value to be used for a section relative to a NoSQL environment corresponding to a Cloudant database
- CouchDB: this is the value to be used for a section relative to a NoSQL environment corresponding to a CouchDB database
- MongoDB: this is the value to be used for a section relative to a NoSQL environment corresponding to a MongoDB database
- HDFS: this is the value to be used for a section relative to an Hadoop environment corresponding to an HDFS file system
- Hive: this is the value to be used for a section relative to an Hadoop environment corresponding to an Hive database
- HBase: this is the value to be used for a section relative to an Hadoop environment corresponding to an HBase database
- MapReduce: this is the value to be used for a section relative to a customized MapReduce utility for an Hadoop environment
- RemoteDB2: this is the value to be used for a section relative to a remote Db2 database cataloged locally
- Warehouse: this is the value to be used for a section relative to a Db2 Warehouse environment
- AWS_S3: this is the value to be used for a section relative to an Amazon S3 environment
- Swift: this is the value to be used for a section relative to a Swift environment
a list of parameters, among which some are mandatory ones, each specified with a pattern of the kind name=value.

Creating and filling such a file without the specification of a LOADDEST clause is useless. But, for a given destination specified through the LOADDEST clause, when an associated data transfer command must be generated, it is mandatory that a section corresponding to this destination be existing into the db2hpu.dest file.

Parameters specific to NoSQL destinations

url: It is a mandatory one. It allows to specify the url of the NoSQL destination concerned. If necessary, it is also possible to add a port number after the url, with the following pattern: <url>:<port>.
dbname: It is a mandatory one for the Cloudant and MongoDB destinations, for other destinations it is not required. It allows to specify the database name to be used for the JSON documents uploaded.
collectionname: It can only be used for the MongoDB destination, for which it is a mandatory one. It allows to specify the collection name to be used for the JSON documents uploaded.

Parameters specific to Hadoop destinations

hdfspath

It can be used along with whatever destination keyword, and it is a mandatory one. It must be used to specify a location on the associated HDFS file system. For the HDFS destination keyword, this location corresponds to the final one where the file containing the data extracted has to be copied. For the other keywords, it corresponds to the temporary location where the file containing the data extracted is copied for a subsequent upload command against this copy. The location specified must exist before executing the copying step, otherwise an error is returned by the HDFS file system when this step is performed. Also this location must correspond to an absolute HDFS path, an HDFS path being compulsorily a UNIX-like one.

pig

It can only be used along with the HBASE destination keyword, and it is an optional one. Its value can be yes or no. By default, if this parameter is not set, the value is no. In order to upload data to an HBase destination, there are two tools which can be used (hadoop or pig). The tool used, when the pig parameter is set to no or when it is not specified, is hadoop, along with the usage of a MapReduce program provided and installed with Optim High Performance Unload, and called “HBaseMapReduce.jar”. Otherwise, when the pig parameter is set to yes, the tool used is pig.

url

It can only be used along with the HIVE destination keyword, and it is a mandatory one in this case. It must be used to indicate a JDBC url corresponding to an Hive database. The url to be specified here must strictly correspond to an already functioning one. The pattern expected for the specification of this url varies, depending if an associated authentication method has been specified or not:

without authentication:
```
jdbc:hive2://<hivehost>:<port>/<db>
```

with a Kerberos authentication:

jdbc:hive2://<hivehost>:<port>/<db>;principal=<Hive_Principal>

In order to get more details about these JDBC url specifications, one can have a look to the Hive documentation.

When a Kerberos authentication is involved, there are two distinct Kerberos principals to be considered and specified:

the one to be used for the preliminary authentication step before the upload command, which must be specified in the db2hpu.dest configuration file, through its user parameter,
the one at the end of the JDBC url specification, corresponding to the Hive administrator.

command

It can only be used along with the MAPREDUCE keyword, and it is a mandatory one in this case. It allows to specify a partial command line based on the usage of a MapReduce program of one's own, along with optional command line arguments to be passed to it. This string specification is then used for the generation of the effective upload command.

Parameters specific to Db2 Warehouse destinations

curl: It is an optional one. Its value can be yes or no. By default, if this parameter is not set, the value is no. When the 'curl' parameter is set to no or unspecified, the upload is based on the CLPPlus tool for an upload command generation, or the Db2 Load API for an automatic migration. Otherwise, when the 'curl' parameter is set to yes, the tool used is cURL, for both upload command generation or automatic migration.
url: It is a mandatory one. It allows to specify the url of the Db2 Warehouse server concerned.
port: It is a mandatory one only when the 'curl' parameter is unspecified or set to no, it is useless instead. It allows to specify the port number of the Db2 Warehouse server concerned.
dbname: It is a mandatory one only when the 'curl' parameter is unspecified or set to no, it is useless instead. It allows to specify the database name to be used for the upload.

When using the CLPPlus tool or the Db2 Load API, in order to obtain the url, port and dbname values to be configured, one must log into the Db2 Warehouse web console concerned. On the web console, one must then choose the ‘Connect’ menu item, and then the ‘Connection Information’ menu item. The information searched can be found on either the ‘Connection without SSL’ tab or the ‘Connection with SSL’ tab, associated to the fields called ‘Host name’, ‘Port number’ and ‘Database name’ respectively

Parameters for Big Data and Db2 Warehouse destinations

alias: It is an optional one. Its value is case sensitive. It must be set if an alias is specified into a LOADDEST clause within a control file. In this case, this parameter is the criteria allowing to determine that its associated section is the one to be taken into account. For a migration task towards a NoSQL or a Db2 Warehouse destination, if a standard authentication method is specified into a LOADDEST clause, it must be set too. In this case, its value must match with the name of a section created in the credentials file for the destination type considered.
binpath: It is an optional parameter. When considering a Db2 Warehouse destination, it is useless if migrating data calling the Db2 Load API. It allows to specify a directory which is used when building the upload command, by prepending it to the name of the upload tool, in order to avoid any problem with finding it when trying to execute it for the effective upload.
host: It is a mandatory parameter when migrating data only, otherwise it is useless. When considering a Db2 Warehouse destination, it is useless if migrating data calling the Db2 Load API. It allows to specify the name of the machine on which the upload command to be used for the data migration is going to be generated and executed.

Parameters specific to RemoteDB2 destinations

dbname: It is a mandatory one. It allows to specify the database name to be used for the Db2 Load.
node: It is an optional one. It must be set with a remote Db2 node name. Its value is case sensitive. For a migration task, it must be set, and it is used to get the associated credentials. In this case, its value must match with the name of a section of remote type created in the credentials file. For a Db2 Load command generation, if an authentication method and a node name are specified into a LOADDEST clause within a control file, it must be set too. In this case, this parameter is the criteria allowing to determine that its associated section is the one to be taken into account.

Parameters specific to Amazon S3 destinations

curl: It is an optional one. Its value can be yes or no. By default, if this parameter is not set, the value is no. When the 'curl' parameter is set to no or unspecified, the tool considered is AWS CLI for both upload command generation or automatic migration. Otherwise, when the 'curl' parameter is set to yes, the tool used is cURL, for both upload command generation or automatic migration.

bucket: It is a mandatory one. It allows to specify the name of the Amazon bucket concerned.

relativepath: It is a mandatory one. It allows to specify the relative path where the files generated by Optim High Performance Unload will be uploaded.

filemaxsize: It is an optional one. It allows to specify the file size limit of the output files generated. By default, the size limit applying to the files for an Amazon S3 destination is of 100MB, which corresponds to the size recommended by Amazon. This parameter must be set to a numerical value which unit is the megabyte. Its value must be between 4 and 5242880 (5TB), the latter being the upper limit for the Amazon S3 objects size.

accesskey: It is a mandatory one only when the 'curl' parameter is set to yes, it is useless instead. It allows to specify the access key of the user account.

secretkey: It is a mandatory one only when the 'curl' parameter is set to yes, it is useless instead. It allows to specify the secret key of the user account.

version: It is an optional one if the 'curl' parameter is set to yes, it is useless instead. It corresponds to the signature version which will be used for uploading the data towards the Amazon bucket. Its value can be 2 or 4. By default, if this parameter is not set, the value is 4. When specifying the signature version 2, the underlying hashing algorithm used is SHA-1. With the signature version 4, the SHA-256 algorithm is used for the hashing.
Note: Depending on the creation date of the bucket considered, the signature version 2 is not necessarily supported. More details about this can be found within the official Amazon S3 documentation.

encrypt: It is an optional one which corresponds to the Amazon Server Side Encryption (SSE). When the 'curl' parameter is set to yes, it is only consistent with the signature version 4 (see ‘version’ parameter). Its value can be AES256 or aws:kms and is case sensitive.

region: It is a mandatory one only when the 'curl' parameter is set to yes and when the 'version' parameter is unspecified or set to 4, it is useless instead. It allows to specify the bucket region name concerned.

profile: It is an optional one if the 'curl' parameter is unspecified or set to no, it is useless instead. It allows to specify a named profile which corresponds to AWS CLI credentials.

When considering an Amazon S3 destination, the authentication relies on two parameters, which are the access key and the secret key associated to the Amazon account considered. In the case where Optim High Performance Unload is configured to generate an upload command based on the cURL tool, these keys must be set within the db2hpu.dest configuration file through the usage of the ‘accesskey’ and ‘secretkey’ parameters. If the tool to be considered is AWS CLI, these keys must be stored into it before executing any upload command generated with Optim High Performance Unload, by the user aimed to execute the upload command. This step can be performed by executing the ‘aws configure’ command, as follows:

aws configure
AWS Access Key ID [None]: USER_ACCESS_KEY
AWS Secret Access Key [None]: USER_SECRET_KEY
Default region name [None]: BUCKET_REGION_NAME
Default output format [None]:

Note: When one wants to upload data towards Amazon S3 using the cURL tool, there are several rules or requirements:

it is not supported to generate such an upload command for a Windows platform.
if the signature version is 4, the machine where the upload command will be executed must have the openssl command installed and accessible through the PATH environment variable.
if the signature version is 2, the machine where the upload command will be executed must have the openssl and base64 commands installed and accessible through the PATH environment variable.

Parameters specific to Swift destinations

auth_version

It is an optional one. It corresponds to the authentication version to be used. The possible values are 1, 2 and 3. If this parameter is not set, Optim High Performance Unload will try to get the value of the “ST_AUTH_VERSION” environment variable. If this environment variable is not set too, the authentication version 3 will be used by default.

Note: the authentication version 1 corresponds to the Swift TempAuth mechanism, the authentication versions 2 and 3 correspond to the Swift Keystone Auth mechanism.

url

It is an optional one. It allows to specify the Swift server url. This url must be specified according to the authentication version chosen. The patterns depending on the authentication version are the following ones:

version 1: http://<url>:<port>/auth/v1.0
version 2: http://<url>:<port>/v2.0
version 3: http://<url>:<port>/v3

If this parameter is not set, Optim High Performance Unload will try to get either the value of the “ST_AUTH” environment variable when using the authentication version 1, or the value of the “OS_AUTH_URL” environment variable when using the authentication version 2 or 3.

container

It is a mandatory one. It corresponds to the container where Optim High Performance Unload output files will be uploaded.

project

It is an optional one. It can be specified only for the authentication versions 2 and 3, it is useless instead. Depending on the authentication version, the parameter has different meanings:

for version 2, it corresponds to a swift tenant name.
for version 3, it corresponds to a swift project name.

If this parameter is not set, Optim High Performance Unload will try to get either the value of the “OS_TENANT_NAME” environment variable when using the authentication version 2, or the value of the “OS_PROJECT_NAME” environment variable when using the authentication version 3.

domain

It is an optional one. It can be specified only for the authentication version 3. It corresponds to a high-level container for projects, users and groups. If this parameter is not set, Optim High Performance Unload will try to get the value of the “OS_DOMAIN_NAME” environment variable.

region

It is an optional one. It can be specified only for the authentication version 3. It corresponds to the region in which the objects are stored. If this parameter is not set, Optim High Performance Unload will try to get the value of the “OS_REGION_NAME” environment variable.

id

It is an optional one. Its value can be yes or no. When it is set to yes, it allows to consider a user ID instead of a user name into the swift upload command generated by Optim High Performance Unload. This parameter is only taken into account with the authentication version 3, it is ignored instead.

filemaxsize

It is an optional one. It allows to specify the file size limit of the output files generated. By default, the size limit applying to the files for a Swift destination is of 5GB. This parameter must be set to a numerical value which unit is the megabyte. Its value must be between 4 and 5120 (5GB), the latter being the upper limit for the Swift objects size.

Authentication parameters - standard authentication method

In order to rely on such an authentication method, the WITH STANDARD AUTH option must be specified into the LOADDEST clause. This standard method must be used when the authentication to the destination considered is based on the usage of a user and its associated password. This is the unique method which can be specified for a Db2 destination, for the Cloudant and CouchDB NoSQL destinations and for the Swift destination. It can be specified too for the MongoDB destination.

user

It allows to specify the user to be considered. This user must have the appropriate permissions to perform the expected data transfer command, and it is taken into account when generating it. The way its associated password is handled depends on the task performed:

if the task performed is a standard unload one towards a remote Db2 destination, accompanied by a Db2 Load command generation, this parameter is an optional one. If the associated LOADDEST clause contains the specification of its WITH STANDARD AUTH option, it is used into the preliminary connection step generated before the Db2 Load command itself.
if the task performed is a standard unload one towards a Big Data destination, accompanied by an upload command generation, this command is generated with a preliminary step created for an obfuscated prompting of the password. When executing the command in question, the password must be typed first, and then the effective upload command is executed, taking into account the password typed.
if the task performed is an automatic migration one, this parameter is a mandatory one. The underlying data transfer command generated must internally contain a reference to a password. This password is read from the credentials file associated to the user executing the automatic migration. It must be stored beforehand into this file, creating a credentials section of the appropriate type: the remote, cloudant, couchdb, mongodb, warehouse and swift types must be respectively considered, depending if the destination chosen is a remote Db2 one, a Cloudant one, a CouchDB one, a MongoDB one, a Db2 Warehouse one or a Swift one. If the destination is a remote Db2 one, the user specified must be the same as the one set to the user parameter into the credentials section considered.
when considering a Swift destination, there are several rules applying to this parameter. It is an optional one. In most cases, this parameter must be set with a user name. But, if the ‘id’ parameter is set to yes, and the authentication version chosen is 3, this parameter must be set with a user ID. If this parameter is not set, Optim High Performance Unload will try to get the value of an environment variable depending on the authentication version chosen. If the authentication version chosen is 1, the environment variable considered is the “ST_USER” one. If the authentication version chosen is 2, or if the authentication version chosen is 3 and the ‘id’ parameter is not set to yes, the environment variable considered is the “OS_USERNAME” one. If the authentication version chosen is 3 and the ‘id’ parameter is set to yes, the environment variable considered is the “OS_USER_ID” one.
Note: in a Swift environment, one can set a password associated to a user through an environment variable, the “ST_KEY” one for the authentication version 1, the “OS_PASSWORD” one for the authentication version 2 or 3. But, these environments variables are never taken into account by Optim High Performance Unload when performing an automatic migration. Such a password must be stored into the Optim High Performance Unload credentials file instead.

Authentication parameters - Kerberos authentication method

In order to use such an authentication method, the WITH KERBEROS AUTH option must be specified into the LOADDEST clause. This Kerberos method must be used when the authentication to the destination considered is based on the usage of a Kerberos principal. This authentication method can be specified for the MongoDB destination and is the unique one which can be specified for all the Hadoop destinations.

user

It allows to specify the Kerberos principal to be considered. This Kerberos principal must have the appropriate permissions to perform the expected data transfer command. It is taken into account when generating the data transfer command, by adding a preliminary step for the obtainment of a Kerberos ticket by this principal. This step is based on a kinit command, which is part of the Kerberos client. It must be installed before trying to obtain a ticket. Obtaining such a ticket also relies on the usage of a password. The way this password is handled depends on the task performed:

If the task performed is a standard unload one accompanied by an data transfer command generation, the associated keytab parameter can be configured too. If it is configured, the preliminary step for the obtainment of the Kerberos ticket is generated so that it directly refers to the keytab file configured. If it is not configured, the preliminary step is generated so that it will prompt for a password when being executed.
If the task performed is an automatic migration one, it is mandatory to configure an associated keytab parameter.

keytab

It allows to specify an absolute path to a keytab file in which the principal to be taken into account must have its password stored beforehand.