Apache HBase

Apache HBase is an open-source, distributed, column-oriented store. HBase is the Hadoop database and provides Bigtable-like capabilities in addition to Hadoop and HDFS.

Before you begin

Ensure that the MapReduce framework in IBM® Spectrum Symphony is set to use Hadoop or Cloudera's Distribution including Hadoop (CDH). For the supported versions, see Supported distributed files systems for MapReduce or YARN integration.To run HBase with Cloudera, download the required packages from the Cloudera web site. For the supported versions of HBase that the MapReduce framework in IBM Spectrum Symphony has been qualified with, see Supported third-party applications for MapReduce.

About this task

Follow these steps to use HBase within the MapReduce framework in IBM Spectrum Symphony:

Procedure

  1. Download and install HBase.
  2. Untar the package to HBASE_HOME.
    For example:
    $ tar -zxvf hbase-0.98.4-hadoop2-bin.tar.gz -C /root
    or
    $ tar -zxvf hbase-0.96.1-hadoop2-bin.tar.gz -C /root
  3. Set up the user environment on all HBase hosts:
    export HBASE_HOME=/root/hbase
    export PATH=$PATH:$HBASE_HOME/bin
    export JAVA_HOME=/root/jre
  4. Modify HBase configuration files:
    1. Edit the $HBASE_HOME/conf/hbase-env.sh file as follows:
      export JAVA_HOME=/root/jre
    2. Edit the $HBASE_HOME/conf/hbase-site.xml file as follows:
      <configuration>
      <property>
          <name>hbase.rootdir</name>
          <value>hdfs://qa1:7020/hbase</value>
          <!--value>gpfs:///hbase</value--$
          <description>The directory shared by RegionServers.
          </description>
        </property>
        <property>
          <name>hbase.cluster.distributed</name>
          <value$true</value>
          <description$The mode the cluster will be in. Possible values are
          false: standalone and pseudo-distributed setups with managed Zookeeper
          true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
          </description$
        </property>
        <property>
          <name$hbase.zookeeper.quorum</name>
          <value$qa2,qa3,qa4,qa5,qa6,qa7</value>
          <description$The directory shared by RegionServers.
          </description>
        </property>
        <property>
          <name$hbase.zookeeper.property.dataDir</name>
          <value$/opt/export/zookeeper</value>
          <description$Property from ZooKeeper's config zoo.cfg.
          The directory where the snapshot is stored.
          </description>
        </property>
      <$/configuration>
    3. Edit the $HBASE_HOME/conf/regionservers file to include qa2 qa3 qa4 qa5 qa6 qa7.
  5. Create a symbolic link to the real hbase directory.
    For example:
    # ln -s hbase-0.98.4-hadoop2 /root/hbase
  6. For IBM Spectrum Scale FPO, do the following (assuming that hbase is installed under the $HBASE_HOME directory):
    1. Run:
      cp /usr/hadoop/etc/hadoop/core-site.xml $HBASE_HOME/conf
    2. Update the $HBASE_HOME/conf/hbase-site.xml file as follows:
      <property.
      		<name.hbase.rootdir</name>
      		<value>gpfs://hbase</value>
      </property>
    3. Run:
      cp hadoop-2.4.0-gpfs.jar $HBASE_HOME/lib/
    4. Run:
      • For Linux x86-64: mkdir $HBASE_HOME/lib/native/Linux-amd64-64
      • For PowerLinux: mkdir $HBASE_HOME/lib/native/Linux-ppc64-64
    5. Run:
      • For Linux x86-64:
        cp libgpfshadoop.64.so $HBASE_HOME/lib/native/Linux-amd64-64/libgpfshadoop.so
      • For PowerLinux:
        cp libgpfshadoop.64.so $HBASE_HOME/lib/native/Linux-ppc64-64/libgpfshadoop.so
  7. Start the HBase daemons by running the following command:
    # $HBASE_HOME/bin/start-hbase.sh
  8. Add the following environment variables to $PMR_HOME/conf/pmr-env.sh:
    1. Export the USER_CLASSPATH:
      export USER_CLASSPATH=/root/hbase/lib/hbase-client-0.98.4-hadoop2.jar:
      /root/hbase/lib/hbase-common-0.98.4-hadoop2.jar:/root/hbase/lib/hbase-
      protocol-0.98.4-hadoop2.jar:/root/hbase/lib/hbase-hadoop-compat-0.98.4-
      hadoop2.jar:/root/hbase/lib/htrace-core-2.04.jar:/root/hbase/lib/zookeeper-
      3.4.6.jar:/root/hbase/lib/guava-12.0.1.jar
    2. Set PMR_EXTERNAL_CONFIG_PATH:
      PMR_EXTERNAL_CONFIG_PATH=/root/hadoop/etc/hadoop:$HBASE_HOME/conf

Example of using the HBase RowCounter tool

RowCounter is a MapReduce job that counts the rows in the indicated table. For example, if there is an HBase table named testhbase1, you can use RowCounter to count the table testhbase1 within the MapReduce framework in IBM Spectrum Symphony, like this:
# HADOOP_CLASSPATH='${HBASE_HOME}/bin/hbase classpath' 
mrsh jar /root/hbase/lib/hbase-server-0.98.4-hadoop2.jar 
rowcounter 'testhbase1'
You are using Hadoop API with 2.4.x version.
... ...
2014-10-31 18:11:17,803 INFO  [main] internal.MRJobSubmitter: Connected to JobTracker(SSM)
2014-10-31 18:11:17,827 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-10-31 18:11:18,083 INFO  [main] Configuration.deprecation: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
2014-10-31 18:11:18,091 WARN  [main] internal.MRJobSubmitter: !Not using C++ framework, because comparator class < org.apache.hadoop.hbase.io.ImmutableBytesWritable > for map output key class < org.apache.hadoop.hbase.io.ImmutableBytesWritable$Comparator > is not supported in C++!
2014-10-31 18:11:20,879 INFO  [main] internal.MRJobSubmitter: Job <rowcounter_testhbase1> submitted, job id <703>
2014-10-31 18:11:20,879 INFO  [main] internal.MRJobSubmitter: Job will not verify intermediate data integrity using checksum.
2014-10-31 18:11:20,882 INFO  [main] mapreduce.Job: Running job: job_ssm_0703
2014-10-31 18:14:07,940 INFO  [main] mapreduce.Job: Job job_ssm_0703 running in uber mode : false
2014-10-31 18:14:07,944 INFO  [main] mapreduce.Job: map 0% reduce 0%
2014-10-31 18:14:14,950 INFO  [main] mapreduce.Job: map 100% reduce 100%
2014-10-31 18:14:14,951 INFO  [main] mapreduce.Job: Job job_ssm_0703 completed successfully
2014-10-31 18:14:16,181 INFO  [main] mapreduce.Job: Counters: 20
        Map-Reduce Framework
                Map input records=4
                Map output records=0
                Input split bytes=40
                GC time elapsed (ms)=67
        File System Counters
                GPFS: Number of bytes read=0
                GPFS: Number of bytes written=0
                GPFS: Number of large read operations=0
                GPFS: Number of read operations=0
                GPFS: Number of write operations=0
        HBase Counters
                BYTES_IN_REMOTE_RESULTS=128
                BYTES_IN_RESULTS=128
                MILLIS_BETWEEN_NEXTS=476
                NOT_SERVING_REGION_EXCEPTION=0
                NUM_SCANNER_RESTARTS=0
                REGIONS_SCANNED=1
                REMOTE_RPC_CALLS=3
                REMOTE_RPC_RETRIES=0
                RPC_CALLS=3
                RPC_RETRIES=0
        org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
                ROWS=4

This example shows that there are 4 rows in the table.