Apache Gluten accelerated Spark engine

Apache Gluten serves as a native engine plugin designed to accelerate Spark SQL and DataFrame workloads, by integrating native engines with SparkSQL, leveraging the high scalability of the Spark SQL framework and the superior performance of native engines. It is fully compatible with the Apache Spark API, ensuring that no code changes are necessary.

The following are key features of using Apache Gluten accelerated Spark engine.

  • Support for SQL and equivalent Data Frame operations with Delta and Parquet tables.
  • Accelerated queries that process data faster and include aggregations and joins.
  • Faster performance when data is accessed repeatedly from the disk cache.
  • Robust scan performance on tables with many columns and many small files.
  • Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, including wide tables that contain thousands of columns.
  • Replaces sort-merge joins with hash-joins by default.

To provision Apache Gluten accelerated Spark engine, see Adding a Spark engine.

Supported operators, expressions, and data types

The following are the operators, expressions, and data types that Apache Gluten covers.
Operators
  • Scan, Filter, Project
  • Hash Aggregate/Join/Shuffle
  • Nested-Loop Join
  • Null-Aware Anti Join
  • Union, Expand, ScalarSubquery
  • Delta/Parquet Write Sink
  • Sort
  • Window Function
For more information about Operators support, see Operators and Functions Support.
Expressions
  • Comparison / Logic
  • Arithmetic / Math (most)
  • Conditional (IF, CASE, etc.)
  • String (common ones)
  • Casts
  • Aggregates(most common ones)
  • Date/Timestamp
For more information about function support, see Scalar Functions Support.
Data types
  • Byte/Short/Int/Long
  • Boolean
  • String/Binary
  • Decimal
  • Float/Double
  • Date/Timestamp
  • Struct
  • Array
  • Map

Supported Spark versions

IBM watsonx.data allows the following Spark runtime versions to run Spark workloads.

IBM® watsonx.data supports the following Spark runtime versions to run Spark workloads.
Name Status Release date End-of-support date Supported languages
Apache Spark 3.4.4 Supported JAN 2025 JUNE 2026

Python 3.10

Scala 2.12

Apache Spark 3.5.4 Supported JUNE 2025 FEB 2028

Python 3.11

Scala 2.12

The following examples show you sample payloads for submitting Spark job for different languages.
  • Payload for submitting a Spark job with Python 3.10:

    {"application_details":{"application":"<your application_file_path>","arguments":["<your_application_arguments>"],"conf":{"spark.app.name":"MyJob","spark.eventLog.enabled":"true"},"env":{"RUNTIME_PYTHON_ENV":"python310"}}}
  • Payload for submitting a Spark Scala job:

    {"application_details":{"application":"/opt/ibm/spark/examples/jars/spark-examples*.jar","arguments":["1"],"class":"org.apache.spark.examples.SparkPi","conf":{"spark.app.name":"MyJob","spark.eventLog.enabled":"true","spark.driver.memory":"4G","spark.driver.cores":1,"spark.executor.memory":"4G","spark.executor.cores":1,"ae.spark.executor.count":1},"env":{"SAMPLE_ENV_KEY":"SAMPLE_VALUE"}}}

Best Practices

  • Apache Gluten requires Large Off Heap memory as it integrates with native backend that is, Velox using Apache Arrow's OffHeap columnar format, which is essential for large-scale data processing without exceeding JVM heap limits. If there is lack of enough Off Heap memory, it could lead to potential OOMs (see Limitation ) Therefore by default 75% of executor memory is set to Off Heap.
  • You can adjust of percentage of memory set for Off Heap, using the configuration ae.spark.offHeap.factor ,which accepts values (0-1) exclusive, example: 0.75
  • It is recommended for users to set this to lower value, < 0.5 if their workload has lot of fallback to Java based Spark. (see Limitation ) so that OOM does not happen while falling back to Java based Spark execution. It is recommended to set this to higher value, 0.75 & above if their workloads executes natively on Apache Gluten without fallback. By default the value set is 0.75

Limitation

  • Using Amazon S3 object stores support DAS for application submission, but other object stores like ADLS and GCS requires explicit credentials to be passed.
  • Smaller queries are not accelerated.* Catalog association is only supported for s3 object stores.
  • Fallback to JVM
    • ANSI: Apache Gluten currently does not support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.
    • File Source format: Currently, Apache Gluten only support Parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla Spark.
  • Incompatible behavior
    • JSON functions: Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, Apache Gluten gives incorrect result.
    • Parquet read configuration: Apache Gluten supports spark.files.ignoreCorruptFiles with default false, if true, the behavior is same as config false. Apache Gluten ignores spark.sql.parquet.datetimeRebaseModeInRead, it only returns what write in parquet file. It does not consider the difference between legacy hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla Spark.
    • Parquet write configuration: Spark has spark.sql.parquet.datetimeRebaseModeInWrite config to decide whether legacy hybrid (Julian + Gregorian) calendar or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the Parquet to read is written by Spark with this config as true, Velox's TableScan will output different result when reading it back.
  • Spill
    • OutOfMemory exception may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.
  • CSV Read
    • The header option should be true. And now we only support DatasourceVI, user should set spark.sql.sources.useV1SourceList=csv. User defined read option is not supported, which will make CSV read fall back to vanilla Spark in most case. CSV read will also fall back to vanilla Spark and log warning when user specifies schema is different with file schema.
    • For more detailed info on Apache Gluten limitations, see limitation.
  • Apache Gluten is not supported for FIPS enabled environment.
  • You must provision a new Apache Gluten accelerated Spark engine when you upgrade your watsonx.data instance to 2.2 as the preview engine would not be functional.