Table of contents

Data skipping library

Use the data skipping library to boost the performance of SQL queries by skipping over irrelevant data based on summary metadata associated with each data object. Data skipping is a performance optimization feature which means that using data skipping does not affect the content of the query results.

All Spark native data formats are supported, including Parquet, ORC, CSV, JSON and Avro. When an SQL query is triggered, the summary metadata is scanned and all objects whose metadata does not overlap with the given predicate in the query are skipped.

You can use the data skipping library in Cloud Pak for Data to:

Using the library for Spark applications

To use data skipping, you need to create indexes on one or more columns of the data set. After this is done, Spark SQL queries can benefit from data skipping. In general, you should index the columns which are queried most often in the WHERE clause.

Three index types are supported:

Index type Description Applicable to predicates in WHERE clauses Column types
MinMax Stores minimum and maximum values for a column <,<=,=,>=,> All types except for complex types. See Supported Spark SQL data types.
ValueList Stores the list of unique values for the column =,IN,LIKE All types except for complex types. See Supported Spark SQL data types.
BloomFilter Uses a bloom filter to test for set membership =,IN Byte, String, Long, Integer, Short

You should use bloom filters for columns with very high cardinality. Index creation invokes a Spark job which writes metadata (indexes) to a user specified location, in Parquet format.

Note that the metadata is typically much smaller in volume than the data. If changes are made to some of the objects in the data set after the index was created, data skipping will still work correctly but will not skip data in the changed objects. To avoid this, you should refresh the indexes.

Geospatial data skipping

Data skipping is also supported in spatial queries on geospatial data sets with latitude and longitude columns. See Geospatio-temporal library for how to use spatial indexing functions.

The list of supported geospatial functions includes the following:

  • ST_Distance
  • ST_Intersects
  • ST_Contains
  • ST_Equals
  • ST_Crosses
  • ST_Touches
  • ST_Within
  • ST_Overlaps
  • ST_EnvelopesIntersect
  • ST_IntersectsInterior

Next step

Create an application that uses the data skipping library. This can include configuring parameters. See: