Reduce I/O and accelerate SQL performance by orders of magnitude using data skipping.
In this post, we’re going to show you how data skipping can boost the performance of your queries with IBM Cloud SQL Query and reduce cost by skipping over up to 99% of your dataset. SQL Query is a serverless SQL service for querying structured data on IBM Cloud Object Storage (COS) with less specialized expertise and at a lower cost than traditional DBMSs. SQL Query uses Apache Spark SQL as its query engine and charges on a per-query basis, based on the amount of data scanned. Reducing the number of bytes scanned per query saves money as well as improving performance.
One way to reduce the amount of data scanned is to use best practices for data layout, such as using the Parquet format and adopting Hive style partitioning. Data skipping, a SQL Query feature, complements these best practices and provides significant additional cost savings and performance gains. In this blog post, we explain the notion of data skipping and provide examples of how to create, use, and delete data skipping indexes.
What is data skipping?
Our data skipping indexes apply to structured datasets in IBM Cloud Object Storage (COS), and they store summary metadata for each object in the dataset. For example, if the dataset has a temp column, the summary metadata could include the minimum and maximum temperatures in the object. This summary metadata, which is significantly smaller than the data itself, can then be indexed. SQL queries which apply a predicate on the temp column—for example, queries looking for temperatures >30C—can then benefit from the index by skipping over all objects whose metadata does not overlap with the predicate (e.g., we can skip over all objects whose maximum temp is lower than 30C since their data cannot possibly satisfy the query).
Our sample Parquet dataset originates from the meter_gen data generator that was developed by Gridpocket in the context of the IOStack project. It generates electricity, water, and gas meter readings along with their associated timestamps, geospatial locations, and additional information. This dataset is 18GB and is publicly available for use with SQL Query at
cos://us-geo/sql/metergen. The queries in this blog are available as SQL Query samples under the “Data Skipping” heading. (See below for instructions to sign up for the data skipping feature.)
An example row in json format is shown here:
Our sample dataset contains data for 10 days, so we use Hive-style partitioning and partition according to date. For this purpose, we added a derived column called “dt” to the meter_gen data which contains the calendar date portion of the date timestamp. This means for the row above we add a column
"dt":"2017-09-16". Additional schema information can be found here.
Depending on the data properties and the query workload, it is often advisable to pay attention to the data layout (i.e., the way records are assigned to objects). Since our query workload is geospatial in nature, we applied data layout according to the
lng columns so that each object contains geographically close records. If the data cannot be directly ingested in this way, this requires reading the dataset and writing a new version of it. We did this for each of the 10 days, using an internal data layout tool.
Note that all the SQL Query samples are located at
cos://us-geo/sql/, which is a cross-region COS bucket. If cross-region resiliency is not needed, using a regional bucket is recommended for analytics uses cases.
Assigning a base location for data skipping indexes
Data skipping indexes are stored in Cloud Object Storage (COS). Before creating indexes, define the base location in COS to store them by using the following command:
Metadata should not be stored under the path where the data is stored to avoid schema conflicts. Make sure you use a path where you have write access if you need to create, refresh, or delete indexes.
Creating data skipping indexes
When creating a data skipping index on a dataset, one needs to decide which columns to index and choose an index type for each column. These choices are workload and data dependent. In general, you should index the columns which are queried most often in the WHERE clause. Currently, three index types are supported:
Note that value list should be chosen for a column when the number of distinct values for that column per object is typically much smaller than the total number of values for that column per object (otherwise the index could be as big as that column of the dataset). Recall that we laid out the sample data according to geospatial location. Since city is correlated with geospatial location, we expect a relatively small number of distinct city values per object. When the number of distinct column values per object is high, use a bloom filter index instead.
We now create a data skipping index on the metergen dataset using the following syntax:
The COS url should specify the top level (i.e., root) of the dataset. A unique index is built per dataset (identified by its root path) and SQL Query account. Two users with different SQL Query accounts create independent indexes on the same dataset.
Note that indexes can be shared across SQL Query accounts. Users who have read access to the base location for an index can use it by setting their base location appropriately. However, care should be taken to avoid multiple users writing indexes for the same dataset to the same base location. Alternatively, users can avoid sharing indexes by using different base locations.
Describing an index
Use the DESCRIBE operation to retrieve data skipping index statistics and metadata. For example:
The result includes how many objects were indexed, whether the index is up to date, the base location for the indexes, and the index types that were generated.
Using data skipping indexes
Once a data skipping index has been created, it is automatically used by SQL Query when running queries. We will demonstrate a query benefiting from each one of the index types mentioned above, and then show a geospatial query. The times shown below represent a single invocation of each query and will likely vary.
Our first query retrieves electricity meters and associated information when the temperature is over 30C, ordered from highest to lowest. This query makes use of the min/max index on the temp column:
The result set contains 28 rows, covering three cities—Vidauban, Roquebrune-sur-Argens, and Le Muy—and reports the highest recorded temperature as 30.08. METER873676, located in Vidauban, is one of the three electricity meters with this record temperature.
By clicking on Query details, one can see various data skipping statistics, such as the number and percentage of objects skipped.
For formats which don’t support min-max data skipping—such as CSV, JSON, and Avro—data skipping will typically give a significant performance boost and cost reduction. Even though Parquet supports min-max data skipping, our centralized metadata approach boosts performance for large datasets. A follow-up performance blog will provide explanations and apply data skipping to larger datasets.
We now retrieve the minimum, maximum and average electricity usage in the city of Vidauban. This query makes use of the valuelist index on the city column. Since the index column contains meter readings, we take the difference between two readings to calculate electricity usage:
This query returns an average usage of 14.571 kwh, although there seems to be a wide range of usage in Vidauban.
For this query, significantly less data is scanned when data skipping is used (50.6 MB instead of 6.3 GB), resulting in cost reductions. The performance improvement grows as the dataset size increases. When using Parquet, only relevant columns need to be read, which also reduces the data scanned. This is not the case for CSV and JSON formats, where the entire dataset would be scanned for this query unless data skipping is used (see also the section on “Other Formats” below).
We now retrieve more information about the first three meters returned by Query 1. This query benefits from the bloom filter index on the vid column:
As for Query 2, significantly less data is scanned when data skipping is used (390.58 MB instead of 7.6 GB), resulting in cost reductions, and the performance improvement grows as the dataset size increases.
We would now like to compare the electricity usage of METER873676 with its neighbors within a radius of 1km. This query benefits from the min-max indexes on the lat and lng columns. The query uses functions from the IBM geospatial toolkit, integrated into SQL Query, where the ST_Distance function returns the distance between two geospatial points.
The result set contains 28 meters and their usage. We see that meter METER873676 is ranked in fifth place compared with its neighbors—its usage is 2.904 kwh, which is much lower than the average for Vidauban as we saw in Query 2.
Data skipping provides significant performance improvements and some cost reduction for this query. The performance boost increases in proportion to the dataset size. We will follow up with a performance blog to discuss this in more detail.
We integrated IBM’s geospatial toolkit with data skipping to provide benefit for many of the library’s functions—ST_Distance is just one example. The list supported by data skipping also includes
More complex queries
Query predicates like the ones shown above can be combined using AND, OR, and NOT to create more complex queries. Data skipping can also provide benefit to queries with JOINs as long as query predicates can be pushed down by Spark SQL to the base tables.
Using other formats
Data skipping can be used with all of the formats supported by SQL Query. Note that CSV and JSON formats require the entire dataset to be scanned as a first step in order to infer the schema, prior to running any query. In order to avoid this use a Hive Metastore. Note also that unlike Parquet and ORC, these formats do not have built-in data skipping capabilities and can potentially benefit more from data skipping when comparing data scanned.
Refreshing data skipping indexes
If data is added to a dataset after a data skipping index is created, the new data will not be skipped during queries. Once the amount of new data becomes significant, it is advisable to refresh the index incrementally as follows:
Deleting data skipping indexes
A data skipping index can be deleted as follows: