Connecting to a HBase data source

To access HBase data sources, you must define a connection by using the properties in the Connection section on the Properties page. One instance of HBase connector is always linked with one table (for example, with a single connector instance you can read or write data to a single HBase table).

Before you begin

  • From the cluster that hosts the HBase database to which you want to connect, copy the core-site.xml and hbase-site.xml files and distribute them to all player nodes.
  • You must also copy the target HBase client jar files which the HBase connector will use to connect to the target database. All the HBase client jars should be distributed to all player nodes. Ensure that you always use HBase client libraries compatible with the version of the target database.
  • In case of BigIntegrate, you don’t have to copy core-site.xml, hbase-site.xml and HBase client jars if they are available on all nodes in uniform locations. However if you want to connect to HBase placed in a different Hadoop cluster, you must copy all the files from the cluster and distribute them to all player nodes.
  • Define a job that contains the HBase Connector stage

Procedure

  1. In the job design canvas, double-click the HBase Connector stage icon to open the stage editor.
  2. On the Properties page, specify values for the connection properties.
    1. Provide Hadoop Identity. It should be a user readable name for your cluster. This property is optional in Designer and mandatory in IMAM. It will appear as the name of HostSystem that hosts the database after yoou import metadata in IMAM
    2. Provide HBase Identity. It should be user readable name for your database. This property is optional in Designer and mandatory in IMAM. It will appear as the name of the Database that you import Metadata from, in IMAM.
    3. The path to the core-site.xml and hbase-site.xml files is copied from the target cluster and placed in some local directory. All the information required to connect to target HBase database (such as zookeeper quorum, port, parent znode) will be read from those files.
    4. Specify the HBase client jars. Provide semicolon-separated list of hbase-client.jar and its dependencies' locations. Each entry can be a directory or a single jar file. To include all child directories, each directory is traversed recursively. The connector requires those jars to be available under specified locations on each node it runs. This list is platform dependent.

      Cloudera – use jars from /opt/cloudera/parcels/your-active-parcel/lib/hbase/lib or alternatively download shaded jar from Cloudera Maven repository at https://mvnrepository.com/artifact/org.apache.hbase/hbase-client and /opt/cloudera/parcels/your-active-parcel/lib/hadoop/client. Additionally use jars from /opt/cloudera/parcels/your-active-parcel/lib/hadoop/client

      Hortonworks – use jars in /usr/hdp/current/hbase-client/lib/ and /usr/hdp/current/hadoop-client/

      MapR - use jars from /opt/mapr/hbase/hbase-1.x.x/lib which, among others, contains HBase client libraries. In addition to that /opt/mapr/hadoop/hadoop-2.x/share/hadoop/common contains the required Hadoop libraries.

    5. Select Authentication method. The options are:
      1. None - Works only with clusters that aren't secured. Specify a user that will be used to access HBase by specifying Simple authentication user name.
      2. Kerberos using password
        1. Specify the krb5.conf location that is accessible on each node. This file contains Kerberos configuration information. It is found in /etc directory, if you are connecting from outside the cluster copy such as site.xml files.
        2. Provide Principal in the name@REALM format.
        3. Provide Password.
        4. Set Use ticket cache to Yes if you want to use existing ticket stored in credential cache. Provide ticket cache location accessible on each node. You must run kinit on each node instead of copying the cache. If you leave it empty, connector will use the default location as specified in krb5.conf. If login from cache fails, connector will fall back to login with password.
      3. Kerberos using keytab
        1. Specify the krb5.conf location that is accessible on each node. This file contains Kerberos configuration information. It is found in /etc directory, if you are connecting from outside the cluster copy such as site.xml files.
        2. Provide Principal in the name@REALM format.
        3. Provide Keytab location which is accessible on each node. Keytabs can be distributed by engine. You must copy the keytab to any location on edge node and set job environment variable APT_YARN_HBASE_DEFAULT_KEYTAB_PATH to this exact location. Ensure that you leave Keytab property empty since it takes priority over the variable. Such variables have job scope so it's going to be the same for every HBase stage used in job and therefore keytabs must be merged using ktutil tool. If this can't be done for any reason, you can always distribute the keytabs under separate paths.
    6. In the HBase Namespace and Target table, specify the table name to which you want to connect and namespace in which it is created (if different than the default namespace).
      Note: In MapR, only namespace default can be used.
  3. Click OK to save.