Preparing to install Db2 Big SQL

The following information shows you how to plan and prepare to install the Db2® Big SQL service.

Remote Hadoop cluster or public or private object store

Db2 Big SQL connects to an existing remote big data storage system. The system can be:

an object store
a Hadoop cluster on Cloudera Data Platform (CDP) 7.1.7, including SP1, on x86-64 hardware

You provide the connection information when you provision Db2 Big SQL.

To connect to a remote Hadoop cluster, you must meet the following requirements:

The HDFS NameNode and Hive metastore services must be accessible from the Db2 Big SQL service. Specifically, the services must be bound to an IP address that is accessible outside the Hadoop cluster.
To make the HDFS NameNode accessible, do the following steps:
1. In Cloudera Manager, navigate to the HDFS service.
2. Click the Configuration tab.
3. Locate HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml and click Add.
4. Create the property dfs.namenode.rpc-bind-host and set its value to 0.0.0.0.
Data nodes on the remote Hadoop cluster must be accessible to Db2 Big SQL, and the associated ports must be open on each data node to the Db2 Big SQL service. The following components must also be accessible:
- Cloudera Manager server
- Hive server(s)
- Zookeeper
- Hive metastore
- Ranger (if configured).
In HDFS, the /user/db2inst1 directory must be created. An HDFS administrator must run the following commands to create the directory, make the db2inst1 user the owner, and allow the db2inst1 user to write to it:
```
hdfs dfs -mkdir /user/db2inst1
hdfs dfs -chmod 755 /user/db2inst1
hdfs dfs -chown -R db2inst1 /user/db2inst1
```

To enable automatic syncing of the Db2 Big SQL catalog and the Hive metastore, an additional subdirectory is needed. Run the following commands:

hdfs dfs -mkdir /user/db2inst1/sync
hdfs dfs -chmod 777 /user/db2inst1/sync
hdfs dfs -chown -R db2inst1:hdfs /user/db2inst1
hdfs dfs -chown -R db2inst1:hdfs /user/db2inst1/sync

Ensure that the DNS for the Red Hat® OpenShift® cluster is configured so that it can resolve hosts in the Hadoop cluster. This configuration means that a DNS entry exists for each host in the Hadoop cluster that maps the hostname to an IP address. There is also an additional requirement for reverse DNS entries when Kerberos security is enabled. For more information, see Updating the key distribution center (KDC).

To connect to an object store, you must meet the following requirements:

The object store must be one of the following Hadoop S3a compatible object stores: Amazon Web Services S3, IBM® Cloud Object Storage, IBM Storage Scale Object Storage, or Red Hat OpenShift Data Foundation/Ceph®.
The credentials must permit read and write access on the bucket(s) that Db2 Big SQL interacts with.

Giving Db2 Big SQL access to the Hive warehouse directory

For Db2 Big SQL to be able to read and write data on the HDFS, the db2inst1 user must be granted access privileges on the Hive warehouse directory. The specific privileges required depend on whether Db2 Big SQL impersonation is enabled or not.

If Db2 Big SQL impersonation is not enabled, the db2inst1 user requires some permissions on the warehouse directories. These permissions can be granted via a policy in the Ranger HDFS plugin for CDP or via a HDFS ACL. For example, if tables in all schemas are to be created, populated with data, altered, or dropped, run the following commands to grant the required permissions using HDFS ACLs:

hdfs dfs -setfacl -R -m user:db2inst1:rwx /warehouse/tablespace/managed/hive
hdfs dfs -setfacl -R -m default:user:db2inst1:rwx /warehouse/tablespace/managed/hive
hdfs dfs -setfacl -R -m user:db2inst1:rwx /warehouse/tablespace/external/hive
hdfs dfs -setfacl -R -m default:user:db2inst1:rwx /warehouse/tablespace/external/hive

Note: If there are many tables in the Hive warehouse, this set of commands might take some time to run.

However, if Db2 Big SQL is to have access to only a subset of the schemas, permissions can be granted at the schema level. For example, the following commands grant all permissions on only the SALES schema to the db2inst1 user:

hdfs dfs -setfacl -R -m user:db2inst1:rwx /warehouse/tablespace/external/hive/sales.db
hdfs dfs -setfacl -R -m default:user:db2inst1:rwx /warehouse/tablespace/external/hive/sales.db

If Db2 Big SQL is to have access to tables outside of the warehouse (that is, when the LOCATION clause is specified), then the required permissions can be granted via a policy defined in the Ranger HDFS plugin, or via file access control lists (FACLs) that are similar to those shown in this section should be set for the specified directories.

Updating the key distribution center (KDC)

If you are connecting to a Hadoop cluster on which Kerberos security is enabled, you must update your KDC for Db2 Big SQL.

Starting with version 7.2.3, Db2 Big SQL uses the version of Kerberos that is shipped with Red Hat Enterprise Linux® 8.4, which supports only strong encryption ciphers. When you connect Db2 Big SQL to a Hadoop cluster with MIT Kerberos security enabled, you can use the following encryption types:

aes256-cts
aes128-cts
aes256-cts-hmac-sha1-96

For more information about how to set the ciphers to be used by Kerberos, see the Cloudera documentation page Managing Kerberos credentials using Cloudera Manager.

Edit the /var/kerberos/krb5kdc/kadm5.acl file. Change the asterisk (*) to e*. For example:
```
cat /var/kerberos/krb5kdc/kadm5.acl
*/admin@IBM.COM e*
```

Restart the KDC by running the following commands:

sudo service krb5kdc restart
sudo service kadmin restart

Notes:

Db2 Big SQL obtains a new Kerberos ticket approximately every 12 hours. Therefore, the ticket lifetime defined in the Kerberos configuration (KDC, krb5) must be set to at least 16 hours.

If you are using Active Directory Kerberos or would prefer to manually pre-create the principal and keytab required for Db2 Big SQL, see Enabling Active Directory as the Kerberos server.

Ensure that the DNS for the OpenShift cluster is configured so that the Kerberos server and Hive metastore hosts in the Hadoop cluster have valid forward and reverse DNS entries. This configuration means that a DNS entry exists that maps the hostname (of the KDC and the Hive metastores) to an IP address, and a reverse entry exists for that IP address that maps it back to the original hostnames.

Enabling Active Directory as the Kerberos server

If you are connecting Db2 Big SQL to a Hadoop cluster that is using Active Directory (AD) as its Kerberos server, generate custom keytabs for the db2inst1 user.

Generate the keytabs for the db2inst1 user, and copy them on a workstation that can access the Red Hat OpenShift cluster.

ktpass -princ db2inst1@REALM -out db2inst1.keytab -pass Password -mapuser db2inst1 -crypto all -ptype KRB5_NT_PRINCIPAL

When you provision a Db2 Big SQL instance, you upload the keytab file when you set up a connection to the Hadoop cluster.

Connecting to an TLS (SSL) enabled Hadoop cluster

To connect Db2 Big SQL to a Hadoop cluster that uses SSL and TLS protocols, it is recommended that you create a secret that contains your company's CA certificate. When this secret is created, the certificate is used to connect to all Hadoop services. For more information, see Using a CA certificate to connect to internal servers from the platform.

What to do next

Installing Db2 Big SQL