Pig versus Hive: Benchmarking high level query languages

Results of benchmarking studies run on small clusters of nodes

This article presents benchmarking results of two benchmarking sets applied to Hive and Pig, running on Hadoop 0.14.1. In the first benchmarking study, the Apache Pig benchmark (Apache Foundation, 11/07/07) was replicated. In the second study, results were obtained by applying TPC-H benchmarks. (TPC-H is a decision support benchmark published by the Transaction Processing Performance Council, an organization founded to define global database benchmarks). The two studies showed conflicting results.

Benjamin Jakobus, Software Engineer, IBM

Benjamin JakobusBenjamin Jakobus graduated with a BSc in computer science from University College Cork in 2011, after which he cofounded an Irish startup. He returned to University one year later and graduated with an MSc in advanced computing from Imperial College London in 2013. Since graduation, he has worked as a software engineer in the IBM Software Group in Dublin, Ireland.



Peter McBrien, Dr., Senior Lecturer, Imperial College London, UK

Peter McBrienDr. Peter McBrien graduated with a BA in computer science from Cambridge University in 1986. After some time working at Racal and ICL, he joined the Department of Computing at Imperial College as an RA in 1989, working on the Tempora Esprit Project. He obtained his PhD Implementing Graph Rewriting By Graph Rewriting in 1992, under the supervision of Chris Hankin. In 1994, he joined the Department of Computing at King's College London as a lecturer and returned to the Department of Computing at Imperial College in August 1999 as a lecturer. Since then, he has been promoted to Senior Lecturer.



27 May 2014

To determine whether Pig or Hive performs better in database benchmarks, two studies were conducted on small clusters of six and nine nodes:

IBM's Hadoop-based software: InfoSphere® BigInsights™ Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

  • Study 1: Replicated the Apache Pig benchmark (Apache Foundation, 11/07/2007)
  • Study 2: Applied TPC-H benchmarks

In Study 1, Pig seemed to outperform Hive on most operations. However, in Study 2, evidence suggested that Hive is significantly faster than Pig. The article analyzes the two benchmarks, describes the differences, and justifies the results.

The article assumes a basic knowledge about Hadoop and big data and some experience working with benchmarking data.

See Resources for relevant links.


Download

DescriptionNameSize
Full text of benchmarking articlepighivebenchmarking.pdf577KB

Resources

Learn

Get products and technologies

  • Download InfoSphere BigInsights Quick Start Edition, a free, downloadable non-production version of BigInsights that enables new solutions to cost-effectively turn large, complex volumes of data into insight by combining Apache Hadoop with unique, enterprise-ready technologies and capabilities from across IBM.
  • Download InfoSphere Streams Quick Start Edition, a free, downloadable, non-production version of InfoSphere Streams, a high-performance analytic platform that allows user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=972296
ArticleTitle=Pig versus Hive: Benchmarking high level query languages
publish-date=05272014