Generalized Insert and Update Operations for Hive
IraAgrawal 2700057H9K Visits (13483)
(Authors: Ira Agrawal, Harsha Hulageri, Abhishek Pratap Singh)
Introduction to Data Warehousing Tools
Data warehousing tools allow users to analyze information both historical and incremental to find meaningful patterns among data. The areas, where this will be helpful, include business reporting, trend analysis, data mining, and other related areas. It will help business users to slice and dice the data to gain more insight about the available information. It can also be used to understand the customer behavior and use that information for targeted marketing, personalization and context aware benefits for customer facing applications.
DW tools enable provision to extract data from various sources and
store them in a normalized unified format. The data that will be stored
and analyzed are often multi-dimensional and complex in nature.
Limitations of existing traditional tools
Though the existing DW tools are performing well and are flexible, they tend to suffer as the data grows to a size where it will be impossible to store in a single server. This is because these tools operate on the data that is stored in relational databases. They don’t scale well. Though there are several modern relational distributed databases that have tried to overcome this limitation by using massive parallel processing architecture along with sharding and partition capabilities, they still suffer from the point that they can’t scale horizontally. In addition, the databases that provide these distributed capabilities are expensive and have high maintenance cost involved.
Why NoSQL ?
This is the situation where NoSQL databases come into picture. These NoSQL databases are considered as structured storage and traditional RDBMS becomes a subset of it. These databases generally store the data as key-value pair, big table, document store, graphs etc. They are open source, distributed and scale out unlike the relational databases. They scale horizontally and can handle huge amount of data by seamlessly taking advantage of the new nodes unlike the relational databases. They are designed in manner to consider the low-cost hardware in mind and to give high scalability, better performance, easy replication, greater optimization in data querying and insertions. However, these databases do come with few tradeoffs. They are less flexible, often have limited querying capabilities and as the name suggests does not have any query language and the database does not have any fixed table schema.
Comparison of NoSQL databases as DW solution
Considering the massive scalability feature and cost savings
involved by using open source NoSQL solution instead of proprietary
extremely expensive data warehouse solutions, organizations often
accept the limiting factors of NoSQL solutions. With the help of NoSQL
solution, an organization can scale its data warehousing infrastructure
with affordable cost.
While migrating to NoSQL solution from traditional relational
databases, the main concern of users and analysts would be losing the
SQL querying capabilities and the need of using code based map-reduce
jobs. However with the Hadoop based framework such as Hive, it is now
lot easier to migrate and leverage power of Hadoop ecosystem for data
Hive as DW solution
Hive is a distributed data warehouse engine built on top of Hadoop. It uses HDFS to store data and provides SQL like constructs for data queries. Hive Query Language (HQL) a SQL like query language for Hive enables users familiar with SQL to query the data easily.
The queries fired on Hive are translated to a series of map-reduce
jobs which are executed on a Hadoop cluster. It offers high flexibility
by allowing custom mappers/reducers to be plugged into the framework.
This extensibility of plugging custom map-reduce scripts allows hive to
perform code based custom analysis for more complex requirements. Hive
comes with command line tools, thrift server interface as well as JDBC
interface. User can write queries directly in Hive CLI and execute
them. Hive also supports partitioning of tables. By default a table
corresponds to a directory in HDFS and partitions correspond to the
sub-directories. It provides bucketing feature in which the data can be
bucketed into files. The schema is stored in a meta-data store. Hive
supports joins which are typically avoided in other NoSQL solutions.
It allows querying data from various source formats such as text files, binary files or sequence files. In our approach, our primary usage of Hive would be for data summarization and ad hoc analysis which are key features of any data warehousing tool.
In this paper, we will explain in detail as how hive can be used as data warehousing framework for analyzing huge volume of data in a reasonable amount of time. We will see how using hive we read the data from a file and load it into the hive table for further slicing and dicing to get business specific useful information. This paper also explains about the implementation of update feature which is presently not available in Hive framework.
To run any query over Hive we need to establish a connection between
thrift server and a thrift client. The below given code does the same.
It makes a client which has a connection with the hive server and
returns the client to be used for executing the queries over the hive
Client getClient(String hiveServer, Integer hivePort)
Now to explain the insert and the update feature we will refer to a table details model mentioned below which we are going to use across this whole article to explain the operations.
For e.g. we have a table EmployeeData: empId, empName, empCity.
data file input.txt present at /hom
We take few objects which hold the table information to be used for the querying purpose:
The Insert Feature
The incoming data from an input data file needs to be loaded in the already existing tables. The insert feature is simple compared to the update functionality. It appends new records to the existing tables.
Insert can happen into un-partitioned or partitioned table depending upon the requirement.
a. Insert into Un-Partitioned table:
Inserting into un-partitioned table is pretty straightforward. The
following is a code to insert into un-partitioned table in which you
take the input data file path and load it into the un-partitioned
table. The records in the file will get appended to the already
existing records in the table.
insert into un-partitioned table(tableName,
The query will look like:
Once the query is executed, the three records present in the
input.txt file get appended to the EmployeeData table.
b. Insert into Partitioned table:
Unlike un-partitioned table insertion, inserting into partitioned
table involves multiple steps. The below given information is used
while inserting the data in a partitioned main table:
TableFields will hold:
fieldname: empId, fieldType: int
Below is the algorithm to insert the data in partitioned main table:
Algorithm-1: Insert into Partitioned
Step-2: Insert into partitioned staging
table(call as stagingTable2)
stagingTable2, mainTable, table partitions, folder names list
The Update Feature
The update feature is complex as it involves modifying of existing
data instead of simple append. Like insert operation, the first step in
update is to load the data into an un-partitioned staging table from
local data source.
a. Update into Un-Partitioned table:
For update, we use another staging table where the latest records are merged with existing table records using left outer join. The following is the algorithm for the update operation on un-partitioned table.
Algorithm-2: Update into un-partitioned table
Step-2: Load the data by overwriting from
stagingTable3 to mainTable by
b. Update into Partitioned table:
Update operation in partitioned table works bit different than
un-partitioned table. In order to perform update, as first step, latest
data that is present in un-partitioned staging (stagingTable1) table is
put into a temporary partitioned table (stagingTable2). This step uses
the ‘insert into partitioned table’ logic, mentioned earlier, to insert
latest data into a temporary partitioned table.
The following is algorithm for update operation in partitioned table:
Update into partitioned table
- UNION ALL
Once the merged table has updated data of both temp table and
destination table, overwrite destination table with the merged table