Migrating to a new version of Apache Spark

IBM® Z Platform for Apache Spark (FMID HSPK130) is built on Apache Spark. Different service (PTF) levels of the product might provide different versions of Apache Spark. Perform the following steps if you are migrating from one version of Apache Spark to another.

Before you begin

If you previously installed IBM z/OS® Platform for Apache Spark, Version 1.1.0, or IBM Open Data Analytics for z/OS, Version 1.1.0, determine the Apache Spark version that is provided. You can find the Apache Spark version in the RELEASE file in the Spark installation directory. The following sample output indicates that a version of Apache Spark 3.2.4 is provided.

IBM Platform for Apache Spark - Spark, Version 3.2.4 build for Hadoop 3.3.5
Built with Java JRE 1.8.0 IBM ZOS build 8.0.8.25 - pmz6480sr8fp25-20240328_01(SR8 FP25)
Built from Git zos_Spark_3.2.4.5 (revision dd2218dd09818049c3ee64f0a67ecc6d6bba50b2)
Built via Jenkins job zAnalytics/IzODA/zSpark/zos_Spark_3.2.4.5, Build#2
Build flags: -Phive -Phive-thriftserver -Phadoop-3.2

Important: Mixing different Spark components, such as Spark master and worker, from different Apache Spark versions could yield undesirable and unpredictable results. A Spark cluster, for example, might not function properly if the master daemon is started from Apache Spark 3.2.4 whereas the worker daemon is started from Apache Spark 3.5.0.

Important: IBM urges you to install and test the new version of Apache Spark on a test system before you install it on a production system. IBM also recommends that you back up any custom files, such as spark-defaults.conf and spark-env.sh, before installing the new version.

Before installing the new version of Apache Spark

Complete the following steps to understand the impact of migrating to a newer version of Apache Spark on your applications and to update your level of Java™.

Review the new functionality in the new version of Apache Spark and the changes to the Spark APIs to determine any changes that you might need to make to your applications before migration. Use the information at the following links to learn about the changes. Be sure to consider the changes for each in between Apache Spark version. For example, if you are migrating from Apache Spark 3.2.0 to 3.5.0, you need to consider the changes for Apache Spark 3.3.0, and 3.4.0.
- For high-level information about new features; changes, removals, and deprecations made to the Apache Spark APIs; performance improvements and known issues, see the following links.
  - If your previous Apache Spark version is 3.2.0, start here:
- The Spark SQL and Spark ML projects have additional migration changes for each version of Apache Spark. See the following resources for details.
  - http://spark.apache.org/docs/3.2.0/sql-programming-guide.html.
  - http://spark.apache.org/docs/3.2.0/ml-guide.html.
- The following Spark projects have no specific migration steps. However, they might document new behaviors as of the Spark version.
Based on your findings from the information in step 1, update your applications as needed to work with the new Spark version.
If you are using an older Java level than the one indicated in the RELEASE file, consider updating your Java level.
Ensure that any other open source or third-party software in your environment that interacts with Spark supports the new version of Apache Spark. For example, some versions of Scala Workbench do not work with the new versions of Apache Spark.

Installing the new version of Apache Spark

Install IBM Z Platform for Apache Spark, FMID HSPK130 and its service updates (PTFs).

For installation guidelines, see Program Directory for IBM Z Platform for Apache Spark (GI13-5806-01 or later).

After installing the new version of Apache Spark

Recompile applications that use any of the changed Spark APIs.
Examine any new Apache Spark configuration options and make necessary changes to your spark-defaults.conf and spark-env.sh configuration files.
For the current list of configuration options, see http://spark.apache.org/docs/3.2.4/configuration.html or http://spark.apache.org/docs/3.5.0/configuration.html. A new Apache Spark version might introduce new configuration options as well as deprecate existing ones.

Note: For the contents of the spark-defaults.conf and spark-env.sh configuration files, you can find IBM-supplied default values in spark-defaults.conf.template and spark-env.sh.template.
If you use the spark-submit or spark-sql command line interface, you must either invoke them from a writable directory or change your configuration files. For more information, see Updating the Apache Spark configuration files.