September 9, 2019 By Jim Wankowski 4 min read

A look at IBM Db2 Big SQL, Apache Hive, and how they can help you start working with Hadoop.

The challenge

The idea of the traditional data center being centered on relational database technology is quickly evolving. Many new data sources exist today that did not exist as little as five years ago.  Devices such as active machine sensors on machinery, autos and aircraft, medical sensors, RFIDs as well as social media and web click through activity are creating tremendous volumes of mostly unstructured data which cannot be possibly stored or analyzed in traditional relational database management systems (RDBMS’s). 

These new data sources are pushing companies to explore the concepts of big data and Hadoop architecture, which is creating a new set of problems for corporate IT. Hadoop development and administration can be complicated and time-consuming. Developing the complex MapReduce programs to mine this data is complicated and a very specialized skill. Companies need to invest in training their existing personnel or hire people specializing in MapReduce programming and administration. This is the very reason many enterprises have been hesitant to invest in big data applications.

Leveraging existing SQL skills

A solution to this problem is to be able to leverage existing SQL skills for analyzing Hadoop data. Apache Hive was the original solution to this problem. It provides an open source SQL interface to Hadoop. This allows a person with basic SQL skills to run analytics on Hadoop data using a SQL language called HiveQL without the complexity of MapReduce. 

Apache Hive

So, what exactly is Apache Hive? Apache Hive is a data warehouse system for Hadoop. Hive is not a database—it is a metastore that holds the table structure definitions that you define when you create a hive table. This repository is known as the HCatalog and is actually a relational database. The RDBMS is typically MySQL, PostgreSQL, or Oracle.   

It is important to understand that Hive does not provide an OLTP type capability to Hadoop Queries. HiveQL will be translated into MapReduce jobs under the covers. It is really best suited for long running batch type queries due to the performance limitations of this process.

Hive is essentially three things:

  1. MapReduce execution engine
  2. Storage model
  3. Metastore

Hive tables can be partitioned to help improve performance:

IBM Db2 Big SQL

IBM’s Db2 Big SQL takes the Hive SQL capability to a higher level. Big SQL is based on Db2 MPP architecture and sits on top of Hive. Big SQL replaces MapReduce with MPP architecture, which is much faster and provides high concurrency, enabling a much closer representation to an OLTP experience.

For a person with Db2 background, Big SQL makes Hadoop data easily accessible. It has fully ANSI-compliant SQL and the syntax for DDL and DML are nearly identical to native Db2 as well.

Here is an example of a create table statement:

create hadoop table users
(
  id        int           not null primary key,
  office_id int           null,
  fname     varchar(30)   not null,
  lname     varchar(30)   not null)
row format delimited
 fields terminated by '|'
stored as textfile;

Notice the use of not null and primary key in the definitions. This syntax is unique to Big SQL. These keywords are not actually enforced in Hadoop but because it is Db2 at its core, the information is fed to the optimizer and will undergo query rewrite and optimization very similar to Db2.   

Defining RI relationships in these tables allows the optimizer to be more intelligent about join order, just like Db2. Once this DDL is executed, the metadata will be stored in the Hive HCatalog just the same as native Hive tables.

So, now you may want to create a view. Here, again, the syntax is identical to Db2:

create view my_users as
select fname, lname from myschema.users where id > 100;  

Other features that should look familiar to you

  1. “Native Tables” with full transactional support on the head node:
    • Row-oriented, traditional DB2 tables
    • BLU columnar, in-memory tables (on head node only)
    • Materialized query tables
  2. GET SNAPSHOT/snapshot table functions.
  3. RUNSTATS command (db2) à ANALYZE command (Big SQL).
  4. Row and column security.
  5. Federation/fluid query.
  6. Views.
  7. SQL PL stored procedures and UDFs.
  8. Workload manager.
  9. System temporary table spaces to support sort overflows.
  10. User temporary table spaces for declared global temporary tables.
  11. HADR for head node.
  12. Oracle PL/SQL support.
  13. Declared global temporary tables.
  14. Time travel queries.

Get started with IBM Db2 Big SQL

Currently, there are myriad SQL engines for Hadoop available, and different engines solve different problems. It is likely that no single SQL engine will address all your modern data warehousing needs or use cases.

Depending on how your organization is planning on using Hadoop, you will most likely use a combination of SQL engines. For long running batch queries, you may want to use native Hive; for simple ad-hoc queries, you may use native Apache Spark SQL; and for complex BI type of queries, Big SQL fills the bill. 

If your company is starting to dabble with Hadoop or has full-blown production clusters, these SQL engines can help you leverage your Db2 skills to start working with Hadoop. Hopefully this short article will help you kick start your exploration of Big Data!

If you would like to get some hands-on experience with IBM Db2 Big SQL, please visit IBM’s new demo page: Db2 Big SQL. You can see demo videos, go through a click-through guided demo, and get actual hands-on experience by requesting a live Big SQL/HDP cluster in IBM Cloud. This is a great way to get some experience with Hadoop, Hive, and IBM Db2 Big SQL without having to download or install anything.

Was this article helpful?
YesNo

More from Cloud

Bigger isn’t always better: How hybrid AI pattern enables smaller language models

5 min read - As large language models (LLMs) have entered the common vernacular, people have discovered how to use apps that access them. Modern AI tools can generate, create, summarize, translate, classify and even converse. Tools in the generative AI domain allow us to generate responses to prompts after learning from existing artifacts. One area that has not seen much innovation is at the far edge and on constrained devices. We see some versions of AI apps running locally on mobile devices with…

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters