In this blog post, we demonstrate how using the SQL Query Catalog with data skipping indexes can accelerate performance by x14 and lower cost by 43%.
According to today’s best practices for cloud, both analytics and storage should be run as separate services with a serverless pay-as-you-go pricing model. Minimizing the I/O between these services is critical for both high performance and low cost. We showcase multiple techniques to achieve this for SQL analytics on data in object storage.
For this specific case study, we use IBM Cloud SQL Query—a serverless analytics service built using Apache Spark—with a focus on geospatial analytics using weather data from The Weather Company®, stored on IBM Cloud® Object Storage (COS). However, the techniques are valid for SQL analytics in general and are applicable to multiple domains.
The Weather Company dataset
The weather dataset contains a global 4K grid of hourly weather measurements, including 33 columns—such as latitude, longitude, temperature, and wind speed. The data was geospatially laid out using a KD-Tree based approach and stored in COS. We stored one month of weather data in 8192 Parquet objects using snappy compression with a total size of 191GB and used this to simulate weather data for a five-year period, resulting in over 10TB of data and 491,520 objects. Each month of data is stored as a Hive style partition. We also used the SQL Query Catalog, which is based on a Hive Metastore that stores metadata about partitions and can improve performance.
Within each partition, data is grouped into objects geospatially using latitude and longitude columns. For example, a simplified partial object listing for a single monthly partition is below, where each object covers a different geospatial area reflected by its Min/Max statistics for latitude and longitude. Min/Max statistics are stored within Parquet and ORC data formats and data skipping indexes store this and other types of metadata in a consolidated fashion that provides additional optimization.
An example query
We use SQL Query and its integrated geospatial functions to select weather data from the Raleigh Research Triangle for different time windows and store the results in COS. For example, the query below uses a four-month window:
Creating the data skipping index
The data skipping index was created on longitude and latitude columns using the SQL Query DDL statement below, resulting in four metadata objects with total size 13.5 MB. The indexing operation took 2,063.34s seconds and scanned 114 MB. Note that for other query types and applications, there are also value list, bloom filter, and geospatial indexes, where the latter allows indexing columns containing arbitrary geometries, such as polygons:
What we measured
We measured the time taken to run the queries on SQL Query and the amount of data read from COS (which indicates the query cost) and compared the results when using the Catalog and/or data skipping. We ran each query 10 times and took the median result.
Our results: Run time comparison
Overall, we achieved a maximal performance acceleration of x14 and an average acceleration of x9.6 by using both the Catalog and data skipping:
First, we must point out that the yellow curve (turning off the Catalog and data skipping support) is not a true baseline since it already has significant optimizations built in. In SQL Query, queries involving geospatial functions are automatically rewritten to add bounding box constraints on latitude and longitude columns, thereby benefiting from Parquet’s Min/Max statistics on those columns to skip data. This optimization is critical, since without it, queries would need to scan all data points in the given time window, reaching up to 10TB of data for the full five-year window, compared to only 7GB using the rewrite technique. Such an approach is clearly not feasible. Note that a rewriting technique is only effective for Spark-supported formats containing built in Min/Max statistics (i.e., Parquet and ORC only). For these reasons, such formats are a recommended best practice.
Even when using a rewriting technique for the yellow curve, the entire dataset (491,520 objects) needs to be listed irrespective of the time window so that Spark can infer its partitions. Once this is done, Spark can skip irrelevant partitions (months), but it still needs to read the Parquet footers for all objects in the relevant months. If we use the Catalog but keep data skipping turned off (the grey curve), the Catalog handles partition pruning and enables listing only relevant partitions. This explains why the grey curve has consistently lower run times than the yellow curve. However, we still need to read the Parquet footers for all objects in the relevant months. This is why both the yellow and grey curves take time proportional to the size of the time window.
Data skipping indexes contain metadata (e.g., Min/Max statistics) for all objects in a dataset in consolidated form. This enables reading the metadata using a small number of large reads, instead of a very large number of very small reads to access the Parquet footers. When using any object storage system, there is typically a minimum latency per read request and, therefore, the overall difference is significant. In addition, when using data skipping, objects are skipped before Spark splits the dataset into partitions, resulting in less partitions and better utilization of Spark resources.
Using data skipping together with the Catalog (the blue curve) benefits from the Catalog handling partition pruning (reduced listing work) and avoids reading Parquet footers. The curve is approximately flat because the reduced number of partitions required (as a result of data skipping) doesn’t fully utilize the underlying resources. Without the Catalog (the orange curve) listing the entire dataset is required (as for the yellow curve) which adds a significant amount of work irrespective of the size of the time window.
Creating the data skipping index took 2,063.34s, which is comparable to running a single query using maximal time window without the Catalog and data skipping optimizations. Therefore, indexing results in a sustained benefit in this example, after just your first query.
Our results: Cost comparison
SQL Query users are billed according to the amount of data read—the current cost is $5 per TB scanned. Data skipping results in a maximal cost reduction of 43% and results in only 4GB of data read instead of 7GB for the full five-year window. Note again here that the comparison baseline has already been significantly optimized by using Parquet together with our rewrite technique to exploit Parquet Min/Max metadata. This optimization itself provides a x1000 cost reduction since without it, one would need to scan all data points in the given time window, reaching up to 10TB of data for the full five-year window:
All approaches must read the data in relevant objects to compute the query results. The difference in approaches is how much metadata is read to achieve this. When using data skipping, the data skipping metadata is read (13.5 MB) as well as the footers of relevant Parquet objects only. Without data skipping, the footers of all objects in the requested time window need to be read. Note that Parquet footers include additional information beyond Min/Max statistics, such as an object’s schema. This results in a significant cost reduction.
Creating the data skipping index scanned 114MB, which is almost two orders of magnitude less data than running a single query using maximal time window without the Catalog and data skipping optimizations. This huge difference is because indexing scans the indexed columns only and reads metadata from Parquet footers when available, whereas our example query which retrieves all columns. This demonstrates that indexing can pay off significantly even when running only a single query.
We demonstrated that by using the SQL Query Catalog together with data skipping, you can achieve order of magnitude performance improvements while enjoying significant cost reductions at the same time.
Get started with IBM Cloud SQL Query today.
The research leading to these results has received funding from the European Community’s Horizon 2020 research and innovation program under grant agreement n° 779747.