Apache Hive

Apache Hive is a data warehouse system that summarizes data, facilitates ad-hoc queries, and analyzes large data sets stored in Hadoop-compatible file systems. It provides a mechanism to project structure on this data and to query data using a SQL-like language called HiveQL. This language also allows you to plug in custom mappers and reducers when it is inconvenient or inefficient to express logic in HiveQL.

Before you begin

Ensure that the MapReduce framework in IBM® Spectrum Symphony is set to use Hadoop or Cloudera's Distribution including Hadoop (CDH). For the supported versions, see Supported distributed files systems for MapReduce or YARN integration.To run Hive with Cloudera, download the required packages from the Cloudera web site. For the supported versions of Hive that the MapReduce framework in IBM Spectrum Symphony has been qualified with, see Supported third-party applications for MapReduce.

About this task

Follow these steps to run Hive applications with IBM Spectrum Symphony.

Procedure

  1. Install Hive.
    1. Download Hive. The MapReduce framework in IBM Spectrum Symphony is qualified with Hive 0.13.1.
    2. Extract the Hive package.

      $ tar -zxvf hive-0.13.1.tar.gz -C /opt/

  2. Configure Hive.
    1. Set the environment variables for running Hive within the MapReduce framework in IBM Spectrum Symphony. Include HIVE_HOME, PATH, HADOOP_HOME, and HBASE_HOME.
      For example:
      export HIVE_HOME=/bi211/hive
      export PATH=$PATH:$HIVE_HOME/bin
      export HADOOP_HOME=/root/hadoop
      #export HBASE_HOME=/root/hbase
    2. Edit the $HIVE_HOME/bin/hive file by adding the HADOOP=$PMR_BINDIR/mrsh line between lines 202 and 205:
      For example:
      if [ "$hadoop_major_ver" -lt "1" -a  "$hadoop_minor_ver$hadoop_patch_ver" -lt "201" ]; then
          echo "Hive requires Hadoop 2.4.x (x >= 1)."
          echo "'hadoop version' returned:"
          echo `$HADOOP version`
          exit 6
      fi
      
      HADOOP=$PMR_BINDIR/mrsh       #* add this one line *#
      
      # HBase detection. Need bin/hbase and a conf dir for building classpath entries.
      # Start with BigTop defaults for HBASE_HOME and HBASE_CONF_DIR.
      HBASE_HOME=${HBASE_HOME:-"/usr/lib/hbase"}
      HBASE_CONF_DIR=${HBASE_CONF_DIR:-"/etc/hbase/conf"}
      if [[ ! -d $HBASE_CONF_DIR ]] ; then
        # not explicitly set, nor in BigTop location. Try looking in HBASE_HOME.
        HBASE_CONF_DIR="$HBASE_HOME/conf"
      fi
    3. Create or modify the hive-site.xml file, located at $HIVE_HOME/conf, with the hadoop.bin.path entry to set the path to the mrsh utility in your installation:
      For example:
      <configuration>
      		<property>
      			<name>hadoop.bin.path</name>
      			<value>$PMR_BINDIR/mrsh</value>
      		</property>
      </configuration>
      
  3. Run the Hive script from the command line interface to create a Hive table, load some data and select certain records form it, run the following commands for example:

    hive

    Use USER_CLASSPATH instead of HADOOP_CLASSPATH to set the user's classpath.

    hive> create table pokes(foo INT, bar STRING);

    hive> show tables;

    hive> LOAD DATA LOCAL INPATH 'hive-0.7.1/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

    hive> SELECT a.foo FROM pokes a WHERE a.foo > 490;

    Hive cannot use subcommands that mrsh does not support (for example, the fs command used with Hadoop).