Data skipping for Spark SQL

Data skipping can significantly boost the performance of SQL queries by skipping over irrelevant data objects or files based on a summary metadata associated with each object.

Data skipping uses the open source Xskipper library for creating, managing and deploying data skipping indexes with Apache Spark. See Xskipper - An Extensible Data Skipping Framework.

For more details on how to work with Xskipper see:

In addition to the open source features in Xskipper, the following features are also available:

Geospatial data skipping

You can also use data skipping when querying geospatial data sets using geospatial functions from the spatio-temporal library.

The next sections show you to work with the geospatial plugin.

Setting up the geospatial plugin

To use a plugin, load the relevant implementations using the Registration module:

Index building

To build an index, you can use the addCustomIndex API:

Supported functions

The list of supported geospatial functions includes the following:

Encrypting indexes

If you use a Parquet metadata store, the metadata can optionally be encrypted using Parquet Modular Encryption (PME). This is achieved by storing the metadata itself as a Parquet data set, and thus PME can be used to encrypt it. This feature applies to all input formats, for example, a data set stored in CSV format can have its metadata encrypted using PME.

In the following section, unless specified otherwise, when referring to footers, columns, and so on, these are with respect to metadata objects, and not to objects in the indexed data set.

Index encryption is modular and granular in the following way:

Before using index encryption, you should check the documentation on PME and make sure you are familiar with the concepts.

When using index encryption, whenever a key is configured in any Xskipper API, it’s always the label NEVER the key itself.

To use index encryption:

  1. Follow all the steps to make sure PME is enabled. See PME.
  2. Perform all regular PME configurations, including Key Management configurations.
  3. Create encrypted metadata for a data set:

    1. Follow the regular flow for creating metadata.
    2. Configure a footer key. If you wish to set a plain text footer + object name column, set io.xskipper.parquet.encryption.plaintext.footer to true (See samples below).
    3. In IndexBuilder, for each index you want to encrypt, add the label of the key to use for that index.

    To use metadata during query time or to refresh existing metadata, no setup is necessary other than the regular PME setup required to make sure the keys are accessible (literally the same configuration needed to read an encrypted data set).

Samples

The following samples show metadata creation using a key named k1 as a footer + object name key, and a key named k2 as a key to encrypt a MinMax for temp, while also creating a ValueList for city, which is left in plain text.

If you want the footer + object name to be left in plain text mode (as mentioned above), you need to add the configuration parameter:

Data Skipping with joins (Spark 3 only)

With Spark 3, you can also use data skipping in join queries such as:

SELECT *
FROM orders, lineitem 
WHERE l_orderkey = o_orderkey and o_custkey = 800

In this example, you have a star schema based on the TPC-H benchmark schema, where lineitem is a fact table and contains many records while the orders table is a dimension table with a relatively small number of records compared to the fact tables.

The above query has a predicate on the orders tables with a small number of records which means that, by using min/max, you will not benefit much from data skipping.

Dynamic Data Skipping is a feature which enables queries, as in the previous example, to enjoy skipping by first extracting the relevant l_orderkey values based on the condition on the orders table and then using it to push down a predicate on l_orderkey that uses data skipping indexes to filter irrelevant objects.

To use Dynamic Data Skipping enable the following optimization rule:

Thereafter, you can use the Xskipper API as usual and your queries will benefit using Dynamic Data Skipping. For example, in the previous query, indexing l_orderkey by using min/max will enable skipping over the lineitem table and improve query performance.

Support for older metadata

Xskipper supports older metadata created by the MetaIndexManager seamlessly. Older metadata can be used for skipping as updates to the Xskipper metadata are carried out automatically by the next refresh operation.

If you see DEPRECATED_SUPPORTED in front of an index when listing indexes or running a describeIndex operation, the metadata version is deprecated but is still supported and skipping will work. The next refresh operation will update the metadata automatically.