Data Skipping for IBM Cloud SQL Query

By: Paula Ta-Shma

Boost the performance of IBM Cloud SQL Query with data skipping

In this post, we’re going to show you how data skipping can boost the performance of your queries with IBM Cloud SQL Query and reduce cost at the same time. SQL Query is a serverless SQL service which allows for the querying structured data on IBM Cloud Object Storage (COS) with less specialized expertise and at a lower cost than traditional DBMSs. SQL Query uses Apache Spark SQL as its query engine and charges on a per-query basis, based on the amount of data scanned. Reducing the number of bytes scanned per query saves money as well as improving performance.

One way to reduce the amount of data scanned is to use best practices for data layout, such as using the Parquet format and adopting Hive style partitioning. Data skipping, a beta SQL Query feature announced at Think 2019, complements these best practices and provides significant additional cost savings and performance gains. In this blog post, we explain the notion of data skipping and provide examples of how to create, use, and delete data skipping indexes.

What is data skipping?

Our data skipping indexes apply to structured datasets in IBM Cloud Object Storage (COS), and they store summary metadata for each object in the dataset. For example, if the dataset has a temp column, the summary metadata could include the minimum and maximum temperatures in the object. This summary metadata, which is significantly smaller than the data itself, can then be indexed. SQL queries which apply a predicate on the temp column—for example, queries looking for temperatures >30C—can then benefit from the index by skipping over all objects whose metadata does not overlap with the predicate (e.g., we can skip over all objects whose maximum temp is lower than 30C since their data cannot possibly satisfy the query).

Sample data

Our sample CSV dataset originates from the meter_gen data generator that was developed by Gridpocket in the context of the IOStack project. It generates electricity, water, and gas meter readings along with their associated timestamps, geospatial locations, and additional information. This dataset is 18GB and is publicly available for use with SQL Query at cos://us-geo/sql/metergen. The queries in this blog are available as SQL Query samples under the “Data Skipping” heading. (See below for instructions to sign up for the data skipping feature.)

An example row in json format is shown here:

{"vid":"METER753724",
"date":"2017-09-16T21:00:00.000Z",
"index":8366.24,
"sumHC":1.22201,
"sumHP":7.144230000000001,
"type":"elec",
"size":70,
"temp":14.93,
"city":"Paris",
"region":75,
"lat":48.826252,
"lng":2.36259}

Our sample dataset contains data for 10 days, so we use Hive-style partitioning and partition according to date. For this purpose, we added a derived column called “dt” to the meter_gen data which contains the calendar date portion of the date timestamp. This means for the row above we add a column "dt":"2017-09-16". Additional schema information can be found here.

Depending on the data properties and the query workload, it is often advisable to pay attention to the data layout (i.e., the way records are assigned to objects). Since our query workload is geospatial in nature, we applied data layout according to the lat and lng columns so that each object contains geographically close records. If the data cannot be directly ingested in this way, this requires reading the dataset and writing a new version of it. We did this for each of the 10 days, using an internal data layout tool.

Note that all the SQL Query samples are located at cos://us-geo/sql/, which is a cross-region COS bucket. If cross-region resiliency is not needed, using a regional bucket is recommended for analytics uses cases.

Creating data skipping indexes

When creating a data skipping index on a dataset, one needs to decide which columns to index and choose an index type for each column. These choices are workload and data dependent. Typically, one should choose columns to which predicates are applied in many queries. Currently, three index types are supported:

  1. Min/max: Stores the minimum and maximum values for a column. Applies to numeric column types (IntegerTypeLongTypeDoubleType and FloatType).

  2. Value list: Stores the list of values appearing in a column. Applies to strings and numeric column types.

  3. Geospatial: Stores geospatial bounding boxes for a pair of numeric columns (assumed to be latitude and longitude).

Note that value list should be chosen when the number of distinct values in an object is typically much smaller than the total number of values in that object (otherwise the index could be as big as that column in the dataset). Recall that we laid out the sample data according to geospatial location. Since city is correlated with geospatial location, we expect a relatively small number of distinct city values per object.

We now create a data skipping index on the metergen dataset using the following syntax:

CREATE METAINDEX
MINMAX FOR temp,
GEOSPATIAL FOR lat, lng,
VALUELIST FOR city
ON cos://us-geo/sql/metergen STORED AS parquet

The COS url should specify the top level (i.e., root) of the dataset. A unique index is built per dataset (identified by its root path) and SQL Query account. Two users with different SQL Query accounts create independent indexes on the same dataset.

Creating data skipping indexes

Using data skipping indexes

Once a data skipping index has been created, it is automatically used by SQL Query when running queries. We will demonstrate a query benefiting from each one of the index types mentioned above. The times shown below represent a single invocation of the query and will likely vary.

Query 1

Our first query retrieves electricity meters and associated information when the temperature is over 30C, ordered from highest to lowest. This query makes use of the min/max index on the temp column:

select vid, type, temp, city, region, lat, lng
from cos://us-geo/sql/metergen stored as parquet
where type = 'elec' and temp > 30
order by temp desc

The result set contains 28 rows, covering three cities—Vidauban, Roquebrune-sur-Argens,  and Le Muy—and reports the highest recorded temperature as 30.08. METER873676, located in Vidauban, is one of the three electricity meters with this record temperature.

Query 1

By clicking on Query details, one can see various data skipping statistics, such as the number and percentage of objects skipped.

Query details

We also tried running this query without using a data skipping index. We found that with data skipping, this query runs faster, even though the reduction in data scanned (and therefore cost) is modest (53.79 MB instead of 59.38 MB), since Parquet supports min/max data skipping. We are writing a follow-up performance blog which will provide explanations and apply data skipping to larger datasets.

Query 2

We now retrieve the minimum, maximum and average electricity usage in the city of Vidauban. This query makes use of the valuelist index on the city column. Since the index column contains meter readings, we take the difference between two readings to calculate electricity usage:

select min(usage), max(usage), avg(usage) from
(select vid as meter_id, (max(index)-min(index)) as usage
from cos://us-geo/sql/metergen stored as parquet
where city='Vidauban' and type='elec'
group by vid)

This query returns an average usage of 14.571 kwh, although there seems to be a wide range of usage in Vidauban.

Query 2

For this query, significantly less data is scanned when data skipping is used (46.86 MB instead of 6.3 GB), resulting in both cost reductions and performance improvements. In both cases, note that only relevant columns of Parquet objects need to be read, which also reduces the data scanned. This is not the case for CSV and JSON formats, where the entire dataset would be scanned for this query unless data skipping is used (see also the section on “Other Formats” below).

Other Formats

Query 3

We would now like to compare the electricity usage of METER873676 with its neighbors within a radius of 1km. This query benefits from the geospatial index on the lat and lng columns. The query uses functions from the IBM geospatial toolkitintegrated into SQL Query, where the ST_Distance function returns the distance between two geospatial points.

Important note: In addition to reducing the data scanned, the geospatial index also reduces the number of invocations of these functions, which further improves performance:

select vid as meter_id, (max(index)-min(index)) as usage
from cos://us-geo/sql/metergen stored as parquet
where
ST_Distance(ST_Point(lng,lat),ST_WKTToSQL('POINT(6.433881 43.422323)')) < 1000.0
and type='elec'
and lat is not null
and lng is not null
group by vid
order by usage
Query 3

The result set contains 28 meters and their usage. We see that meter METER873676 is ranked in fifth place compared with its neighbors—its usage is 2.904 kwh, which is much lower than the average for Vidauban as we saw in Query 2.

which is much lower than the average for Vidauban as we saw in Query 2.

Data skipping provides significant performance improvements and cost reductions for this query. We ran the query without data skipping and found that 368 MB are scanned instead of 8.03 GB. We also found that data skipping is over five times faster. We will follow up with a performance blog to discuss this in more detail.

We integrated IBM’s geospatial toolkit with data skipping to provide benefit for many of the library’s functions—ST_Distance is just one example. The list supported by data skipping also includes

  • ST_Intersects

  • ST_Contains

  • ST_Equals

  • ST_Crosses

  • ST_Touches

  • ST_Within

  • ST_Overlaps

  • ST_EnvelopesIntersect

  • ST_IntersectsInterior

More complex queries

Query predicates like the ones shown above can be combined using AND, OR, and NOT to create more complex queries. Data skipping can also provide benefit to queries with JOINs as long as query predicates can be pushed down by Spark SQL to the base tables.

Using other formats

Data skipping can be used with all of the formats supported by SQL Query. Note that CSV and JSON formats require the entire dataset to be scanned as a first step in order to infer the schema, prior to running any query. Data skipping does not provide a workaround for this. Note also that unlike Parquet and ORC, these formats do not have built-in data skipping capabilities and can potentially benefit more from data skipping when comparing data scanned.

Refreshing data skipping indexes

If data is added to a dataset after a data skipping index is created, the new data will not be skipped during queries. Once the amount of new data becomes significant, it is advisable to refresh the index incrementally as follows:

REFRESH METAINDEX
ON cos://us-geo/sql/metergen_parquet STORED AS parquet

Deleting data skipping indexes

A data skipping index can be deleted as follows:

DROP METAINDEX
ON cos://us-geo/sql/metergen_parquet STORED AS parquet

Sign up for data skipping (beta)

If you are interested in trying out data skipping with SQL Query, please contact Chris Glew at cglew@us.ibm.com.

References

Be the first to hear about news, product updates, and innovation from IBM Cloud