Data skipping library for notebooks
Data skipping can significantly boost the performance of SQL queries by skipping over irrelevant data based on summary metadata associated with each data object. Data skipping is a performance optimization feature which means that using data skipping does not affect the content of the query results.
The data skipping library is pre-installed on all Spark environments in Watson Studio.
All Spark native data formats are supported, including Parquet, ORC, CSV, JSON, Avro and TSV. When an SQL query is triggered, the summary metadata is scanned and all objects whose metadata does not overlap with the given predicate in the query are skipped.
To use data skipping, you need to create indexes on one or more columns of the data set. After this is done, Spark SQL queries can benefit from data skipping. In general, you should index the columns which are queried most often in the WHERE clause.
Three index types are supported:
|Index type||Description||Applicable to predicates in WHERE clauses||Column types|
|MinMax||Stores minimum and maximum values for a column||<,<=,=,>=,>||All types except for complex types. See Supported Spark SQL data types.|
|ValueList||Stores the list of unique values for the column||=,IN,LIKE||All types except for complex types. See Supported Spark SQL data types.|
|BloomFilter||Uses a bloom filter to test for set membership||=,IN||Byte, String, Long, Integer, Short|
You should use bloom filters for columns with very high cardinality. Index creation invokes a Spark job which writes metadata (indexes) to a user specified location, in Parquet format.
Note that the metadata is typically much smaller in volume than the data. If changes are made to some of the objects in the data set after the index was created, data skipping will still work correctly but will not skip data in the changed objects. To avoid this, you should refresh the indexes.
Geospatial data skipping
Data skipping is also supported in spatial queries on geospatial data sets with latitude and longitude columns. To benefit from data skipping, you can collect min/max indexes on the latitude and longitude columns. See the Geospatio-temporal library for notebooks for how to use spatial indexing functions.
The list of supported geospatial functions includes the following:
- Check out the following sample notebooks to learn how to use the data skipping library:
- Create an application that uses the data skipping library and runs as a Spark job on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark: