The benefits of using object storage to store relational data.
There are many different classes of data—voice recordings, text, pictures, and tabular, among others. Tabular data, which is also known as rectangular or structured data, is of critical importance to businesses, but also to individuals. This class of data is pervasive. It is the records the bank keeps of our transactions, a history of purchases in a store, the grades a teacher gave to a class, or the rate of transmission of COVID-19 by location. In all of these cases, we can have a table of records that defines a relationship.
For the example of a teacher’s grades, the table may have columns for the student’s name, the grade on the first exam, the grade on the second exam, and the final grade for the class. Each record in the table defines a relation among a set of values, where the value of each column has a particular type and each row is an n-tuple with a value for each column. For instance, the grades may be whole numbers and the name a string. There is no inherent ordering to the rows in the table, but one can query them based upon various properties. Returning to the example of grades, the teacher might want to find all of the students whose final grade was above 90. Structured Query Language (SQL) is the way these tables are typically queried.
The relational model
This model of organizing data that has relations was first introduced by Edgar Codd, from IBM, fifty years ago. Initially, it was realized in integrated databases that were responsible for everything from the management of the storage up through the processing of the queries. Such a top-to-bottom integrated system provides the maximum opportunities for performance optimization but limits the ability to share data between different tools.
Over the past decade or so—first with open source frameworks such as Map Reduce, then with Spark, and now with more traditional databases, such as IBM Db2 Event Store—engines that are supporting the processing of relational tables are delegating the storage and management of the data to an object store, typically supporting the S3 API. These engines use a range of formats to organize the tabular data, such as Parquet, ORC, or CSV, and then support the same SQL queries that were supported by the original integrated databases.
Object storage and relational data
This use of object storage for storing the relational data brings several benefits (although, at the price of providing less opportunities for optimization):
- It reduces the cost of developing the overall system by separation of concerns; the query engine no longer needs to address storing the data.
- It maps naturally to cloud deployments, as all clouds have highly scalable cost-efficient object stores.
- It enables much higher scalability in data size, as object stores are essentially infinitely scalable.
- If an open format is used, it allows having different engines and tools to process the same data, allowing you to use the right tool for the job.
IBM has a range of such tools that help data professionals wrangle data and fully harness the power of ANSI SQL over this effectively limitless, yet economical, storage medium. On the IBM Cloud, SQL Query provides serverless execution of SQL for rectangular data stored in IBM Cloud Object Storage. SQL Query is a great tool for data exploration, preparation, transformation, and analytics.
For self-managed capacity hybrid or private cloud workloads, Db2 Big SQL provides a Db2-compatible SQL interface to Object Storage APIs and other data stores through the IBM Cloud Pak for Data on Red Hat OpenShift. And Spark, available with Cloud Pak for Data and via IBM Analytics Engine on the IBM Cloud, has built-in integration with object storage.
To get you started, we have provided free data lake architecture patterns for Big Data Modernization, Logging, and IoT use cases.