Preparing to install Db2 Big SQL

Use this information to plan and prepare to install the Db2® Big SQL service.

System requirements

For information about the minimum system requirements for the Db2 Big SQL service, see System requirements for services.

Storage requirements

Even though Db2 Big SQL data is stored on HDFS or object stores, Db2 Big SQL requires persistent volumes to store its SQL catalog, configuration files, log files, and temporary files created during SQL processing.

Db2 Big SQL supports dynamic storage provisioning. The cluster administrator provides a storage class name, which is used to provision storage for all nodes of the service (the head node and each worker node).

Note: Worker nodes use dynamic provisioning to allow for efficient scaling up and down of the service. The main storage class name is used for the scaling.

For more information about storage requirements for Cloud Pak for Data, see Storage considerations.

Required access on persistent storage

The storage class that you use for Db2 Big SQL must support the ReadWriteMany access mode.

No specific file system access needs to be granted. The Db2 Big SQL service manages the persistent volume, and creates two directories (versioned and unversioned) where all metadata is stored.

Remote Hadoop cluster or public or private object store

Db2 Big SQL connects to an existing remote big data storage system. The system can be an object store, or a Hadoop cluster on the following platforms:

  • Hortonworks Data Platform (HDP) 2.6.5 or 3.1
  • Cloudera Data Platform (CDP) 7.1.3
  • Cloudera Distribution Including Apache Hadoop (CDH) 5.12 to 5.16, and 6.3.

You provide the connection information when you provision Db2 Big SQL.

To connect to a remote Hadoop cluster, you must meet the following requirements:
  1. The HDFS NameNode and Hive metastore services must be accessible from the Db2 Big SQL service. Specifically, the services must be bound to an IP address that is accessible outside the Hadoop cluster.

    To make the HDFS NameNode accessible, do the following steps:

    1. In Ambari (HDP) or Cloudera Manager (CDP and CDH), navigate to the HDFS service.
    2. Click the CONFIGS (HDP) or Configuration (CDP and CDH) tab.
    3. Expand Custom hdfs-site (HDP) or locate HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml (CDP and CDH), and click Add Property (HDP) or Add (CDP and CDH).
    4. Create the property dfs.namenode.rpc-bind-host and set its value to 0.0.0.0.
  2. Data nodes on the remote Hadoop cluster must be accessible to Db2 Big SQL, and the associated ports must be open on each data node to the Db2 Big SQL service. The following components must also be accessible:
    • On HDP clusters, the Ambari server, Hive server(s), Zookeeper, Hive metastore, and Ranger (if configured).
    • On CDP clusters, the Cloudera Manager server, Hive server(s), Zookeeper, Hive metastore, Ranger (if configured).
    • On CDH clusters, the Cloudera Manager server, Hive server(s), Zookeeper, Hive metastore, Sentry (if configured).
  3. On CDH clusters, if Sentry is configured, the Sentry service must be accessible from the Db2 Big SQL service and the service must be bound to an IP address that is accessible outside the Hadoop cluster. In the Sentry Service Advanced Configuration Snippet (Safety Valve) for sentry-site.xml, create the property sentry.service.server.rpc-address and set its value to 0.0.0.0.
  4. In HDFS, the /user/bigsql directory must be created. An HDFS administrator must run the following commands to create the directory, make the bigsql user the owner, and allow the bigsql user to write to it:
    hdfs dfs -mkdir /user/bigsql
    hdfs dfs -chmod 755 /user/bigsql
    hdfs dfs -chown -R bigsql /user/bigsql
  5. If automatic syncing of the Db2 Big SQL catalog and the Hive metastore is enabled, an additional subdirectory is required. Run the following commands:
    hdfs dfs -mkdir /user/bigsql/sync
    hdfs dfs -chmod 777 /user/bigsql/sync
    hdfs dfs -chown -R bigsql:hdfs /user/bigsql
    hdfs dfs -chown -R bigsql:hdfs /user/bigsql/sync
  6. Ensure that the DNS for the Red Hat® OpenShift® cluster is configured so that it can resolve hosts in the Hadoop cluster. This configuration means that a DNS entry exists for each host in the Hadoop cluster that maps the hostname to an IP address. There is also an additional requirement for reverse DNS entries when Kerberos security is enabled. For more information, see Updating the key distribution center (KDC).
To connect to a remote CDH cluster with Sentry enabled, you must set the following permissions on the cluster before you can use Db2 Big SQL with IBM Db2 Data Management Console.
  1. On all Hadoop nodes, create the user dmcuser in the dmcuser group.
  2. Run the following statements to grant the required permissions in Sentry:
    create role dmc_role;
    grant dmc_role to group dmcuser;
    grant all on server server1 to role dmc_role;

After you install Db2 Big SQL and Db2 Data Management Console, you must create the user dmcuser in the dmcuser group in IBM Cloud Pak for Data. For more information, see Requirements for Sentry integration.

To connect to an object store, you must meet the following requirements:
  1. The object store must be one of the following Hadoop S3a compatible object stores: Amazon Web Services S3, IBM Cloud Object Storage, or Red Hat OpenShift Container Storage/Ceph®.
  2. The credentials must permit read and write access on the bucket(s) that Db2 Big SQL interacts with.

Giving Db2 Big SQL access to the Hive warehouse directory

For Db2 Big SQL to be able to read and write data on the HDFS, the bigsql user must be granted access privileges on the Hive warehouse directory. The specific privileges required depend on whether Db2 Big SQL impersonation is enabled or not.

If Db2 Big SQL impersonation is not enabled, the bigsql user requires some permissions on the warehouse directories. These permissions can be granted via a policy in the Ranger HDFS plugin for HDP or CDP, via a Sentry grant for CDH, or via a HDFS ACL. For example, if tables in all schemas are to be created, populated with data, altered, or dropped, on HDP 3.1 and CDP, run the following commands to grant the required permissions using HDFS ACLs:
hdfs dfs -setfacl -R -m user:bigsql:rwx /warehouse/tablespace/managed/hive
hdfs dfs -setfacl -R -m default:user:bigsql:rwx /warehouse/tablespace/managed/hive
hdfs dfs -setfacl -R -m user:bigsql:rwx /warehouse/tablespace/external/hive
hdfs dfs -setfacl -R -m default:user:bigsql:rwx /warehouse/tablespace/external/hive

On CDH, the warehouse directories are under /user/hive/warehouse. On HDP 2.6.5, the warehouse directories are under /apps/hive/warehouse. Adjust these directory paths in Ranger or the HDFS ACLs as required.

Note: If there are many tables in the Hive warehouse, this set of commands might take some time to run.
However, if Db2 Big SQL is to have access to only a subset of the schemas, permissions can be granted at the schema level. For example, the following commands grant all permissions on only the SALES schema to the bigsql user:
hdfs dfs -setfacl -R -m user:bigsql:rwx /warehouse/tablespace/external/hive/sales.db
hdfs dfs -setfacl -R -m default:user:bigsql:rwx /warehouse/tablespace/external/hive/sales.db

If Db2 Big SQL is to have access to tables outside of the warehouse (that is, when the LOCATION clause is specified), then the required permissions can be granted via a policy defined in the Ranger HDFS plugin, or via file access control lists (FACLs) that are similar to those shown above should be set for the specified directories.

If you are planning on enabling impersonation, set the hadoop.proxyuser.bigsql.hosts and hadoop.proxyuser.bigsql.groups properties in Ambari (HDFS > Advanced settings > custom core-site). The hadoop.proxyuser.bigsql.hosts property must include a list of the hostnames or IP address range of the OpenShift compute nodes on which Db2 Big SQL pods can run. The hadoop.proxyuser.bigsql.groups property must include a list of the user groups that the Db2 Big SQL service should be allowed to impersonate.

All actions are performed as the connected user when impersonation is enabled.

Updating the key distribution center (KDC)

If you are connecting to a Hadoop cluster on which Kerberos security is enabled, you must update your KDC for Db2 Big SQL.

  1. Edit the /var/kerberos/krb5kdc/kadm5.acl file. Change the asterisk (*) to e*. For example:
    cat /var/kerberos/krb5kdc/kadm5.acl
    */admin@IBM.COM e*
  2. Restart the KDC by running the following commands:
    sudo service krb5kdc restart
    sudo service kadmin restart
Notes:

Db2 Big SQL obtains a new Kerberos ticket approximately every 12 hours. Therefore, the ticket lifetime defined in the Kerberos configuration (KDC, krb5) must be set to at least 16 hours.

If you are using Active Directory Kerberos or would prefer to manually pre-create the principal and keytab required for Db2 Big SQL, see Enabling Active Directory as the Kerberos server.

Ensure that the DNS for the OpenShift cluster is configured so that the Kerberos server and Hive metastore hosts in the Hadoop cluster have valid forward and reverse DNS entries. This configuration means that a DNS entry exists that maps the hostname (of the KDC and the Hive metastores) to an IP address, and a reverse entry exists for that IP address that maps it back to the original hostnames.

Semaphore requirements

If the Linux® Kernel version on the nodes is less than 4.6, the kernel semaphore settings are too low for Db2 to perform properly, and you must update the settings. For more information, see Updating kernel semaphore settings.

Security Context Constraint requirements

Db2 Big SQL requires a specific Red Hat OpenShift Security Context Constraint (SCC) to be applied to the identity that is running the Db2 Big SQL pod. The SCC is applied automatically as part of a Db2 Big SQL installation.

The SCC grants the following capabilities to the Db2 Big SQL pods:

  • MKNOD, NET_BIND_SERVICE, and NET_RAW are required by low-level management tools, such as ping, that are used to check on pod health and perform other maintenance tasks.
  • elevatePrivilege is required to run a few legacy commands with sudo inside Db2 Big SQL pods.

What to do next

Installing the Db2 Big SQL service