Data skipping library
Use the data skipping library to boost the performance of SQL queries by skipping over irrelevant data based on summary metadata associated with each data object. Data skipping is a performance optimization feature which means that using data skipping does not affect the content of the query results.
All Spark native data formats are supported, including Parquet, ORC, CSV, JSON and Avro. When an SQL query is triggered, the summary metadata is scanned and all objects whose metadata does not overlap with the given predicate in the query are skipped.
You can use the data skipping library in Cloud Pak for Data to:
- Run Spark jobs with Spark SQL on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. See Using the library for Spark applications.
- Run notebooks in Spark environments in Watson Studio. See Data skipping library for notebooks.
Using the library for Spark applications
To use data skipping, you need to create indexes on one or more columns of the data set. After this is done, Spark SQL queries can benefit from data skipping. In general, you should index the columns which are queried most often in the WHERE clause.
Three index types are supported:
You should use bloom filters for columns with very high cardinality. Index creation invokes a Spark job which writes metadata (indexes) to a user specified location, in Parquet format.
Note that the metadata is typically much smaller in volume than the data. If changes are made to some of the objects in the data set after the index was created, data skipping will still work correctly but will not skip data in the changed objects. To avoid this, you should refresh the indexes.
Geospatial data skipping
Data skipping is also supported in spatial queries on geospatial data sets with latitude and longitude columns. See Geospatio-temporal library for how to use spatial indexing functions.
The list of supported geospatial functions includes the following:
Create an application that uses the data skipping library. This can include configuring parameters. See: