Spark SQL 的資料跳過

資料跳過可大幅提高 SQL 查詢的效能，方法是基於與每一個物件相關聯的摘要 meta 資料，跳過不相關的資料物件或檔案。

資料跳過會使用開放程式碼 Xskipper 程式庫，透過 Apache Spark 來建立、管理及部署資料跳過索引。請參閱 Xskipper - 可延伸的資料跳過架構。

如需如何使用 Xskipper 的相關詳細資料，請參閱：

除了 Xskipper 中的開放程式碼特性之外，還提供了下列特性：

地理空間資料跳過
加密索引
資料跳過與結合（僅適用於 Spark 3）
顯示這些特性的範例

地理空間資料跳過

使用時空性程式庫中的地理空間函數來查詢地理空間資料集時，您也可以使用資料跳過。

若要在具有緯度及經度直欄的資料集內從資料跳過中獲益，您可以在緯度及經度直欄收集 min/max 索引。
透過使用內建 Xskipper 外掛程式，即可在具有幾何直欄（UDT 直欄）的資料集中使用資料跳過。

下面幾節將向您顯示如何使用地理空間外掛程式。

設定地理空間外掛程式

若要使用外掛程式，請使用「登錄」模組載入相關實作。請注意，您只能在採用 Apache Spark技術的 IBM Analytics Engine 中使用 Scala ，而不能在 Watson Studio中使用。

若為 Scala：

import com.ibm.xskipper.stmetaindex.filter.STMetaDataFilterFactory
import com.ibm.xskipper.stmetaindex.index.STIndexFactory
import com.ibm.xskipper.stmetaindex.translation.parquet.{STParquetMetaDataTranslator, STParquetMetadatastoreClauseTranslator}
import io.xskipper._

Registration.addIndexFactory(STIndexFactory)
Registration.addMetadataFilterFactory(STMetaDataFilterFactory)
Registration.addClauseTranslator(STParquetMetadatastoreClauseTranslator)
Registration.addMetaDataTranslator(STParquetMetaDataTranslator)

若為 Python：

from xskipper import Xskipper
from xskipper import Registration

Registration.addMetadataFilterFactory(spark, 'com.ibm.xskipper.stmetaindex.filter.STMetaDataFilterFactory')
Registration.addIndexFactory(spark, 'com.ibm.xskipper.stmetaindex.index.STIndexFactory')
Registration.addMetaDataTranslator(spark, 'com.ibm.xskipper.stmetaindex.translation.parquet.STParquetMetaDataTranslator')
Registration.addClauseTranslator(spark, 'com.ibm.xskipper.stmetaindex.translation.parquet.STParquetMetadatastoreClauseTranslator')

索引建置

若要建置索引，您可以使用 addCustomIndex API。請注意，您只能在採用 Apache Spark技術的 IBM Analytics Engine 中使用 Scala ，而不能在 Watson Studio中使用。

若為 Scala：

import com.ibm.xskipper.stmetaindex.implicits._

// index the dataset
val xskipper = new Xskipper(spark, dataset_path)

xskipper
  .indexBuilder()
  // using the implicit method defined in the plugin implicits
  .addSTBoundingBoxLocationIndex("location")
  // equivalent
  //.addCustomIndex(STBoundingBoxLocationIndex("location"))
  .build(reader).show(false)

若為 Python：

xskipper = Xskipper(spark, dataset_path)

# adding the index using the custom index API
xskipper.indexBuilder() \
        .addCustomIndex("com.ibm.xskipper.stmetaindex.index.STBoundingBoxLocationIndex", ['location'], dict()) \
        .build(reader) \
        .show(10, False)

支援的函數

受支援的地理空間函數清單包括：

ST_Distance
ST_Intersects
ST_Contains
ST_Equals
ST_Crosses
ST_Touches
ST_Within
ST_Overlaps
ST_EnvelopesIntersect
ST_IntersectsInterior

加密索引

如果您使用 Parquet meta 資料儲存庫，則可使用「Parquet 模組化加密 (PME)」選擇性地加密 meta 資料。將 meta 資料本身儲存為 Parquet 資料集即會達成此目的，如此即可使用 PME 來加密它。此特性適用於所有輸入格式，例如，以 CSV 格式儲存的資料集可以使用 PME 加密其 meta 資料。

在下一節中，除非另有說明，否則在提到標底、直欄等時，這些項目皆與 meta 資料物件相關，而非與索引資料集中的物件相關。

索引加密是按下列方式模組化及精細化：

每一個索引都可以加密 (使用每個索引的金鑰精度) 或以純文字保留
標底 + 物件名稱直欄：
- 除了其他項目以外，本身是 Parquet 檔的 meta 資料物件的標底直欄還包含：
  - meta 資料物件的綱目，顯示所收集全部索引的類型、參數及直欄名稱。例如，您可以瞭解到，BloomFilter 是在直欄 city 中定義，其誤肯定機率為 0.1。
  - 原始資料集的完整路徑或表格名稱（如果是 Hive meta 儲存庫表格）。
- 物件名稱直欄會儲存所有索引物件的名稱。
標底 + meta 資料直欄可以：
- 皆使用相同金鑰進行加密。這是預設值。在此情況下，包含 meta 資料的 Parquet 物件的純文字標底配置處於加密標底模式，且物件名稱直欄會使用選取的金鑰來加密。
- 皆為純文字。在此情況下，包含 meta 資料的 Parquet 物件處於純文字標底模式，且物件名稱直欄不會加密。
  
  如果至少有一個索引標示為已加密，則不論是否已啟用純文字標底模式，皆必須配置標底金鑰。如果已設定純文字標底，則標底金鑰僅用於防竄改。請注意，在該情況下，物件名稱直欄沒有防竄改。
  
  如果已配置標底金鑰，則必須加密至少一個索引。

使用索引加密之前，您應該檢查 PME 上的說明文件，並確定您熟悉概念。

重要事項: 使用索引加密時，每當在任何 Xskipper API 中配置 'key' 時，它一律是 'NEVER the key h本身' 標籤。

若要使用索引加密，請執行下列動作：

遵循所有步驟，以確定已啟用 PME。請參閱 PME。
執行所有一般 PME 配置，包括「金鑰管理」配置。
為資料集建立加密 meta 資料：
1. 遵循建立 meta 資料的一般流程。
2. 配置標底金鑰。如果您想要設定純文字標底 + 物件名稱直欄，請將 io.xskipper.parquet.encryption.plaintext.footer 設為 true (請參閱下面的範例)。
3. 在 IndexBuilder 中，針對您要加密的每個索引，新增要用於該索引的索引鍵標籤。
若要在查詢時間期間使用 meta 資料或要重新整理現有 meta 資料，除了確保可存取金鑰所需的一般 PME 設定（確實需要相同配置才能讀取加密資料集）以外，無需任何設定。

範例

下列範例顯示建立 meta 資料的方法：使用名為 k1 的金鑰作為標底 + 物件名稱金鑰，使用名為 k2 的金鑰作為針對 temp 加密 MinMax 的金鑰，同時還針對 city 建立保留為純文字的 ValueList。請注意，您只能在採用 Apache Spark技術的 IBM Analytics Engine 中使用 Scala ，而不能在 Watson Studio中使用。

若為 Scala：

// index the dataset
val xskipper = new Xskipper(spark, dataset_path)
// Configuring the JVM wide parameters
val jvmComf = Map(
  "io.xskipper.parquet.mdlocation" -> md_base_location,
  "io.xskipper.parquet.mdlocation.type" -> "EXPLICIT_BASE_PATH_LOCATION")
Xskipper.setConf(jvmConf)
// set the footer key
val conf = Map(
  "io.xskipper.parquet.encryption.footer.key" -> "k1")
xskipper.setConf(conf)
xskipper
  .indexBuilder()
  // Add an encrypted MinMax index for temp
  .addMinMaxIndex("temp", "k2")
  // Add a plaintext ValueList index for city
  .addValueListIndex("city")
  .build(reader).show(false)

若為 Python

xskipper = Xskipper(spark, dataset_path)
# Add JVM Wide configuration
jvmConf = dict([
  ("io.xskipper.parquet.mdlocation", md_base_location),
  ("io.xskipper.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")])
Xskipper.setConf(spark, jvmConf)
# configure footer key
conf = dict([("io.xskipper.parquet.encryption.footer.key", "k1")])
xskipper.setConf(conf)
# adding the indexes
xskipper.indexBuilder() \
        .addMinMaxIndex("temp", "k1") \
        .addValueListIndex("city") \
        .build(reader) \
        .show(10, False)

如果要讓標底 + 物件名稱保留為純文字模式（如上所述），您需要新增配置參數：

若為 Scala：

// index the dataset
val xskipper = new Xskipper(spark, dataset_path)
// Configuring the JVM wide parameters
val jvmComf = Map(
  "io.xskipper.parquet.mdlocation" -> md_base_location,
  "io.xskipper.parquet.mdlocation.type" -> "EXPLICIT_BASE_PATH_LOCATION")
Xskipper.setConf(jvmConf)
// set the footer key
val conf = Map(
  "io.xskipper.parquet.encryption.footer.key" -> "k1",
  "io.xskipper.parquet.encryption.plaintext.footer" -> "true")
xskipper.setConf(conf)
xskipper
  .indexBuilder()
  // Add an encrypted MinMax index for temp
  .addMinMaxIndex("temp", "k2")
  // Add a plaintext ValueList index for city
  .addValueListIndex("city")
  .build(reader).show(false)

若為 Python

xskipper = Xskipper(spark, dataset_path)
# Add JVM Wide configuration
jvmConf = dict([
("io.xskipper.parquet.mdlocation", md_base_location),
("io.xskipper.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")])
Xskipper.setConf(spark, jvmConf)
# configure footer key
conf = dict([("io.xskipper.parquet.encryption.footer.key", "k1"),
("io.xskipper.parquet.encryption.plaintext.footer", "true")])
xskipper.setConf(conf)
# adding the indexes
xskipper.indexBuilder() \
        .addMinMaxIndex("temp", "k1") \
        .addValueListIndex("city") \
        .build(reader) \
        .show(10, False)

資料跳過與結合（僅適用於 Spark 3）

透過 Spark 3，您可以在諸如以下的結合查詢中使用資料跳過：

SELECT *
FROM orders, lineitem 
WHERE l_orderkey = o_orderkey and o_custkey = 800

此範例顯示基於 TPC-H 基準性能測試綱目（請參閱 TPC-H）的星狀綱目，其中 lineitem 是事實表格且包含許多記錄，而 orders 表格則是維度表格，其與事實表格相比，具有的記錄數目相對較少。

上述查詢在包含少量記錄的 orders 表格中具有述詞，這意味著使用 min/max 將不會從資料跳過中獲益很多。

動態資料跳過 是一項特性，可讓您根據 orders 表格中的條件先擷取相關的 l_orderkey 值，然後在 l_orderkey 中使用它來推入述詞，以使用資料跳過索引來過濾無關的物件，從而讓諸如以上的查詢從資料跳過中獲益。

若要使用此特性，請啟用下列最佳化規則。請注意，您只能在採用 Apache Spark技術的 IBM Analytics Engine 中使用 Scala ，而不能在 Watson Studio中使用。

若為 Scala：

  import com.ibm.spark.implicits.

  spark.enableDynamicDataSkipping()

若為 Python：

    from sparkextensions import SparkExtensions

    SparkExtensions.enableDynamicDataSkipping(spark)

然後，照常使用 Xskipper API，查詢即會從使用資料跳過中獲益。

例如，在上述查詢中，使用 min/max 來檢索 l_orderkey 將允許跳過 lineitem 表格，從而會改善查詢效能。

支援較舊的 meta 資料

Xskipper 可無縫地支援 MetaIndexManager 建立的較舊 meta 資料。較舊的 meta 資料可以用來跳過，因為下次重新整理作業會自動執行對 Xskipper meta 資料的更新。

列出索引或執行 describeIndex 作業時，如果在索引前面看到 DEPRECATED_SUPPORTED，則表示 meta 資料版本已淘汰，但仍受支援，且跳過將會正常運作。下次重新整理作業會自動更新 meta 資料。