Build a data warehouse with Hive

Data warehousing on a budget

The data warehouse has been an ongoing battle among organizations for years. How do you build it? What data can you integrate? Should you use Kimball or Inmon, corporate information factory (CIF), or data marts? The list could go on for days -- decades, even. With big data, the questions become far more complicated, such as is a data warehouse enough? The answer lies in the enterprise. People claim that Hive is the data warehouse of Hadoop. Although true on one level, it's also something of a false claim. Sometimes, however, you have to use the tools available to you, and for that, Hive can be a data warehouse.

Peter J. Jamack, Big Data Analytics Consultant, Peter J Jamack

Peter Jamack photoPeter J Jamack is a big data analytics consultant who has more than 13 years of business intelligence, data warehousing, analytics, big data, and information management experience. He has integrated structured and unstructured data into innovative integrated analytic solutions, working with various big data and MPP platforms to deliver large-scale, integrated analytics platforms for clients in such industries as insurance, government, media, finance, retail, social media, marketing, and software. You can reach Peter at info@peterjamack.com.



25 June 2013

Also available in Chinese Russian Vietnamese

Three objects approach an enterprise. The first — the data warehouse — is massive: It brings history and experience, and it talks a good game. And most of it is true. But it's also bloated in many ways, expensive in a lot of other ways, and people are tired of the cost for varied results. Apache Hadoop walks into the same building and throws out claims of taking over the market. It preaches big data, velocity, volume, variety, and a bunch of other v words that don't mean much outside of a marketing plan. It throws out analytics, predictions, and much more. And it's cheap. People stop and listen.

Apache Hive steps outside of the box, but does not attempt to beat the other objects. It wants to work with Hadoop, but unlike Hadoop, it doesn't want to throw the data warehouse to the curb. Hive has data warehouse capabilities, but with business intelligence (BI) and analytic limitations. It has database possibilities, but relational database management system (RDBMS) and Structured Query Language (SQL) limitations. It is more open and honest. It relates to the data warehouse. It relates to the RDBMS. But it never comes out and claims that it's more than meets the eye. Hadoop interrupts and proclaims it is the data warehouse for the Hadoop world. Hadoop seems to have sent its best marketing public relations rep, and what went from a simple conversation turned into Hive and Hadoop saving the world. It's intriguing. It's interesting. But is it really true? Sort of.

Data warehouses

Building a true data warehouse can be a massive project. There are different appliances, methodologies, and theories. What is the lowest common value? What are the facts, and what subjects relate back to those facts? And how do you mix, match, merge, and integrate systems that might have been around for decades with systems that only came to fruition a few months ago? This was before big data and Hadoop. Add unstructured, data, NoSQL, and Hadoop to the mix, and suddenly you have a massive data-integration project on your hands.

The simplest way to describe a data warehouse is to realize that it comes down to star schemas, facts, and dimensions. How you go about creating those elements is really up to you — whether it's through staging databases; on-the-fly extract, transform, load processes; or integrating secondary indices. Certainly, you can build a data warehouse with star schemas, facts, and dimensions, using Hive as the core technology, but it won't be easy. Outside the Hadoop world, it becomes an even bigger challenge. Hive is far more an integration, transformation, quick lookup tool compared to a legitimate data warehouse. The schema might say data warehouse, but the usefulness doesn't even say RDBMS. So, why use it?

What is a star schema

Imagine a star — a center and a several "hands" that point in different directions. The center is the heartbeat, or fact table. The hands all point to different dimensions. Many data warehouses have one fact table and multiple dimensions.

A fact table contains any data you can weight or calculate on. In this example, you have baseball statistics like runs, home runs, batting average, etc. You can calculate, add to, subtract from, or multiply against these columns.

Dimensions are more subject-based. In this example, you have the player information dimension, time and date dimension, etc. You typically do not have calculated or weighted columns in dimensions.

In this example, the key that joins a dimensional table to a fact table is the playerID.

Simply put, sometimes, you have to use the tools put before you.

Anyone who has been in IT for any amount of time can tell you that the right tool for a job isn't always available. Or, the right tool is available, but cost-cutting factors are in play. Sometimes corporate politics play a huge role. Whatever the reason, most of us have been in situations where we are forced to build, design, and develop using a tool that might not be the best fit for the job.

I've been on numerous projects where we had to use Hive as a database, as a data warehouse, and as a slowly changing system. It was challenging; it was occasionally annoying. Sometimes, you had to just shake your head and wonder why. But at the end of the day, you still needed to make it work. And if a data warehouse had to be built and used in Hive, and you needed slowly changing dimensions and updates as well as reconciliation of old data, that's what needed to be done. It is not always about the best tool but about making the tool you have work the best.


Hive

Hive opens the big data Hadoop ecosystem to nonprogrammers because of its SQL-like capabilities and database-like functionality. It is often described as a data warehouse infrastructure built on top of Hadoop. This is a partially true statement — since you can transform source data into a star schema — but it's more about design than technology when you create a fact table and dimension tables.

Still, Hive is not really a data warehouse. It's not really even a database. You can build and design a data warehouse with Hive, and you can build and design database tables with Hive, but certain limitations exist that require many workarounds and will pose challenges.

For example, indexing is limited in Hive. How do you overcome this issue? You can create an index in Hive by using the org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler function. Hive and slowly changing dimensions aren't exactly possible, either. But if you build staging tables and use a certain amount of joins (and you plan to add a new table, dumping the old one and keeping only the most recent, updated table for comparison), it is a possibility.

External reporting or analytic systems connecting to Hive have been a huge problem. Even with JDBC connections, you are limited to connecting to the default database only. There has been a push for more and improved metadata, and tools like Apache HCatalog are helping to connect various services to the Hive metastore. In the future, this could be a great addition if utilized properly.

So, although Hive is not a hardcore data warehouse or database, ways exist in which you can use Hive to be that data warehouse or database. It just takes effort and several workarounds to make Hive that system. Why would you go through all that again? Because you have to use the tool you have available to you and make it work.


Example: Build a data warehouse for baseball information

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM’s Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF) , and download BigInsights Quick Start Edition.

The following baseball data example shows how to design and build a data warehouse in Hive using baseball data from Sean Lahman's website. I liked the challenge of denormalizing and building a data warehouse from that data. In "Build a data library with Hive," I created an IBM InfoSphere® BigInsights™ virtual machine (VM) using VMware Fusion on my Apple Macbook. This was a simple test, so my VM had 1 GB of RAM and 20 GB of solid-state disk storage space. The operating system was a CentOS 6.4 64-bit distro of Linux®.

To get started with this example, download the IBM InfoSphere BigInsights Basic Edition (see Resources). You will need to have an IBM Universal ID or register to get an ID before you can download InfoSphere BigInsights Basic Edition.

Import the data

Begin by downloading the CSV file that contains statistics about baseball and baseball players (see Download). From within Linux, create a directory, then run:

$ Sudo mkdir /user/baseball.

sudo wget http://seanlahman.com/files/database/lahman2012-csv.zip

The example contains four main tables, each with a unique column — Master table, Batting, Pitching, and Fielding — and several secondary tables.

Design the data warehouse

This data is structured for a data library, but for a data warehouse, you have to figure out the facts and dimensions. The data warehouse design is simple: You denormalize this database and create a fact table based on player statistics. Then you create dimensions based off of certain subject areas related to those statistics. Hive is not terribly good when it comes to joins, and MapReduce isn't much better, so having a denormalized star schema will help with certain queries.

The design consists of a fact table called fact_Player_Stats, which includes every statistical column found in the various CSV files and tables. You need data from the core tables (Batting, Pitching, and Fielding), as well as from some of the supplemental tables, which also contain statistics. Therefore, you must add the statistical columns from the following tables:

  • AllStarFull
  • hall of Fame
  • BattingPost
  • PitchingPost
  • FieldingOF
  • Salaries
  • AwardsPlayers
  • AwardsSharePlayers
  • Appearances
  • SchoolsPlayers

Some of the tables have only a few statistical columns. For example, from the FieldingOF table, you need add only the columns stint, Glf, Gcf, and Grf to the fact_Player_Stats fact table. For the SchoolsPlayers table, take only the yearMin and yearMax columns. Take similar steps with the other tables. Only statistical columns are needed in the fact table.

Note: You will not be using any data from the tables Managers, Teams, TeamsHalf, SeriesPost, etc.

The fact_Player_Stats fact table consists only of the keys playerID, FranchID, yearID, and SchoolID. For the dimension tables, you must take out the statistics (if any exist) and keep only the subject-related columns:

  • The dim_Players dimensional table takes the data (player names, date of birth, biographical information, etc.) from the Master table. The primary key is playerID.
  • The dim_TeamFranchise dimensional table takes all the data from the TeamFranchise table. The primary key is FranchID.
  • The dim_Schools dimensional table takes all data from the Schools table.
  • The dim_Year is a time dimensional table based on months and years (1871-2012).

Use the data library for the data warehouse

If you haven't already created the baseball data library, I recommend doing so now, then deriving the data warehouse from those base tables. You could write complicated scripts to grab certain columns from a flat file, and reuse that same flat file for another table, but for this article, I choose to use the data library created previously in "Build a data library with Hive."

Build the data warehouse with Hive

With the data analysis and design complete, it's time to build the data warehouse based off of your star schema design. In the Hive shell, create the baseball_stats database, create the tables, load the tables, and verify that the tables are correct. (This process is provided in "Build a data library with Hive.") Next, create the data warehouse fact table. Listing 1 shows the code.

Listing 1. Create the data warehouse fact table
$ Hive

Create Database baseball_stats;

Create table baseball_stats.fact_player_stats as 
         ( SELECT a.playerID, FranchID, yearID, SchoolID, stint int, g int, 
g_batting int, ab int, r int, h int, 2b int, 3b int, hr int, rbi int, sb int, 
cs int, bb int, so int, ibb int, hbp int, sh int, sf int, gidp int, w int, 
l int, g int, gs int, cg int, sho int, sv int, ipouts int, ph int, er int, 
phr int, pbb int, pso int, baopp int, era int,pibb int, wp int, phbp int, 
bk int, bfp int,gf int, pr int, p sh int, psf int, p gidp int, fg int, 
fgs int, innouts int, po int, a int, e int, dp int, pb int, wp int, fsb int,
fcs int, zr int, gamenum int, allstargp int, ballots int, needed int,votes int, 
playoff_g int,playoff_ab int, playoff_r int, playoff_h int, 
playoff_2b int, playoff_3b int, playoff_hr int, playoff_rbi int, playoff_sb int, 
playoff_cs int, playoff_bb int, playoff_so int, playoff_ibb int, playoff_hbp int, 
playoff_sh int, playoff_sh, playoff_sf int, playoff_gidp int, pitchplayoff_w int, 
pitchplayoff_l int, pitchplayoff_g int, pitchplayoff_gs int, pitchplayoff_ cg int, 
pitchplayoff_sho int, pitchplayoff_sv int, pitchplayoff_ipouts int, 
pitchplayoff_h int,pitchplayoff_er int, pitchplayoff_hr int, pitchplayoff_bb int, 
pitchplayoff_so int, pitchplayoff_baopp int, pitchplayoff_era int, pitchplayoff_ibb int, 
pitchplayoff_wp int,pitchplayoff_hbp int, pitchplayoff_bk int, pitchplayoff_BFP int, 
pitchplayoff_gf int, pitchplayoff_r int, pitchplayoff_sh int, pitchplayoff_sf int, 
pitchplayoff_gidp int, glf int, grf int, gcf int, salary double, award int, 
fieldplayoffs_g int, fieldplayoffs_gs int, fieldplayoffs_innouts int, 
fieldplayoffs_po int, fieldplayoffs_a int, fieldplayoffs_e int, fieldplayoffs_dp int,
fieldplayoffs_dp int,fieldplayoffs_tp int, fieldplayoffs_pb int, fieldplayoffs_sb int, 
fieldplayoffs_cs int,  appearances_g_all int, appearances_gs int, 
appearances_g_batting int, appearances_defense int, 
appearances_g_p int, appearances_g_c int, appearances_g_1b int, 
appearances_g_2b int, appearances_g_3b int, appearances_g_ss int, appearances_g_ss int, 
appearances_g_lf int, appearances_g_cf int, appearances_g_rf int, appearances_dh int, 
appearances_ph int, appearances_pr int, yearMin double, yearMax double
from baseball.Batting B JOIN Pitching P ON B.playerid = P.playerID 
JOIN fielding F ON B.playerID = F.playerID 
JOIN Team T ON b.teamid = t.teamid JOIN TeamFranchises TF ON 
t.franchid = tf.franchid ...);

Now, create the data warehouse dimension tables. Listing 2 shows the code.

Listing 2. Create the data warehouse dimension tables
$ Hive

Create table baseball_stats.dim_Players AS
         ( SELECT lahmanID int, playerID int, managerID int, hofID int, birthyear INT, 
           birthMonth INT, birthDay INT, birthCountry STRING, birthState STRING, 
           birthCity STRING, deathYear INT, deathMonth INT,deathDay INT, 
           deathCountry STRING, deathState STRING, deathCity STRING, 
           nameFirst STRING, nameLast STRING, nameNote STRING, nameGive STRING, 
           nameNick STRING, weight decimal, height decimal, bats STRING, 
           throws STRING, debut  INT, finalGame INT, 
           college STRING, lahman40ID INT, lahman45ID INT, retroID INT,
           holtzID INT, hbrefID INT
           FROM baseball.master .... );

Run a query

Let's run a few queries to make sure the data looks right. First, select all the data from the fact table (limit it to the first 10 rows). You can run a couple of other queries to ensure that the dimensional tables look right, that they connect to the fact table, and so on. You also can run a count on the fact table to make sure the total rows line up. Of course, you would have to correlate that back to the original base tables and add them up. Listing 3 shows the code to test whether data exists and is correct and whether dimensions connect to the fact table.

Listing 3. Test to see whether data exists and is correct and whether dimensions connect to the fact table
$ HIVE
	Use baseball_stats;
	Select * from fact_player_stats limit 10;

	Select A.PlayerID, A.name, B.teamID, B.AB, B.R, B.H, B.2B, B.3B, B.HR, B.RBI 
        FROM dim_players A JOIN fact_player_stats B ON a.playerid = b.playerid;

        Select count(*) from fact_player_stats;

        Select count(*) from dim_players.

        Select max(r) from fact_player_stats where playerid=1234;

If you want to do a more thorough validation of the data in the Hive data warehouse, you can use a minimum, maximum, or average on certain columns or all of the columns and compare the results to the original base tables. You would be looking for exact matches.


Conclusion

Obviously, a lot of design work goes into creating a simple star schema. You can go back and create a fact table based on teams, for example. The benefit of this data warehouse schema is that you won't have to join a lot of tables. And for this example, there are few updates besides yearly stats, and, in that case, overwriting the data warehouse tables or adding another year and recalculating shouldn't be a problem.

Hive certainly has its limitations, but if you're working on a budget or the tools have been mandated from on high, with a bit of work, Hive can give you the data warehouse you need.


Download

DescriptionNameSize
Sample CSV filelahman2012-csv.zip11MB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=934529
ArticleTitle=Build a data warehouse with Hive
publish-date=06252013