IBM packages for Apache Spark version 2: Latest news and information

News

Abstract

Latest news and information for the IBM® packages for Apache Spark™ version 2.

Content

IBM Development Package for Apache Spark

These packages combine the data analytics capabilities of Apache Spark, version 2 with IBM SDK, Java™ Technology Edition, Version 8. A number of packages for Linux are available to support different platform architectures. See Downloads.

See Known issues for information about current defects.

You can post questions on dW Answers. Include the "ibmjdk" and "spark" tags to help us find your questions.

Package build levels

Supplementary information about each release is contained in this technote.

Each package contains a different release of IBM SDK, Java Technology Edition, Version 8 and Apache Spark, version 2.

Click on the IBM package links to read more about what's new in each release.

Release date	IBM package	Java level	Apache Spark level
August 2017	2.1.1.1	8.0.4.10	2.1.1
May 2017	2.1.1.0	8.0.4.5	2.1.1
February 2017	2.1.0.1	8.0.4.1	2.1.0
December 2016	2.1.0.0	8.0.3.22	2.1.0
November 2016	2.0.2.0	8.0.3.20	2.0.2
October 2016	2.0.1.0	8.0.3.10	2.0.1
June 2016	2.0.0.0	8.0.3.0	2.0.0

What's new in 2.1.1.1

The 2.1.1.1 release of the IBM packages for Apache Spark contains Apache Spark version 2.1.1 and IBM SDK, Java Technology Edition, version 8, service refresh 4, fix pack 10. Fix pack 10 contains security updates, so you should upgrade to this release.

Notable changes:

Fixed security issues:

Eclipse Jetty vulnerability (CVE-2017-9735)
Apache Spark is vulnerable to cross-site scripting (CVE-2017-7678). This fix is a backport by the IBM team: the issue is included in Apache Spark only in release 2.2.0 and later.

Fix for Spark issue 21176

This release contains the fix for Spark issue 21176, "Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs". This fix is a backport by the IBM team: the fix is included in Apache Spark only in releases 2.1.2, 2.2.0, and later.

What's new in 2.1.1.0

The 2.1.1.0 release of the IBM packages for Apache Spark contains Apache Spark version 2.1.1 and IBM SDK, Java Technology Edition, version 8, service refresh 4, fix pack 5.

What's new in 2.1.0.1

The 2.1.0.1 release of the IBM packages for Apache Spark contains Apache Spark version 2.1.0 and IBM SDK, Java Technology Edition, version 8, service refresh 4, fix pack 1.

Notable changes:

This release of the Software Developers Kit (SDK) and Java runtime environment contains the latest Oracle Critical Patch Update (CPU), plus the following enhancements and changes to default behavior:

Changes to the IBMJSSE2 security provider cipher support; the 3DES algorithm is now considered insecure and is added to the list of disabled algorithms.

What's new in 2.1.0.0

The 2.1.0.0 release of the IBM packages for Apache Spark contains Apache Spark version 2.1.0 and IBM SDK, Java Technology Edition, version 8, service refresh 3, fix pack 22.

For an additional summary of the important changes in Apache Spark version 2.1.0, consider reading this blog post by Reynold Xin.

Notable changes:

Major bugs that are fixed in Apache Spark 2.1.0

A substantial number of major bugs have been fixed in this release.

For more information, see the list of resolved issues in the Apache Spark JIRA.

What's new in 2.0.2.0

The 2.0.2.0 release of the IBM packages for Apache Spark contains Apache Spark version 2.0.2 and IBM SDK, Java Technology Edition, version 8, service refresh 3, fix pack 20.

Notable changes:

Bugs with 'Correctness' tag that are fixed in Apache Spark 2.0.2

There are a number of bugs that have been resolved in the 2.0.2 release of Apache Spark. IBM recommends updating to this version. For more information, see the list of resolved issues in the Apache Spark JIRA.

What's new in 2.0.1.0

The 2.0.1.0 release of the IBM packages for Apache Spark contains Apache Spark version 2.0.1 and IBM SDK, Java Technology Edition, version 8, service refresh 3, fix pack 10.

Apache Hadoop version 2.7.3 is packaged with this release. For more information about this update, see: Apache Hadoop 2.7.3.

Notable changes:

Fixed security issues:

Multiple vulnerabilities fixed in IBM SDK, Java Technology Edition

Improved Catalyst code generation performance
In the 2.0.1.0 release of the IBM packages for Apache Spark, some aggregation functions (such as those with the sum operator) are accelerated by up to 5x when compared to Apache Spark 2.0.1 and earlier releases. This increase in performance is due to improvements in the way that Catalyst generates code. For more information about these improvements, see pull request 11956.

The following examples illustrate processes that are now accelerated:

Accelerated access to DataFrame.cache
Example:
val df = sparkContext.parallelize(0 until 1024 * 1024 * 30, 1).map(i => i.toDouble).toDF.cache
df.count
df.agg(sum(“value”)).collect

For more information about this change, see the following pull requests: 11956, 12894, 14091.
You can disable accelerated access to DataFrame.cache by setting the spark.sql.inMemoryColumnarStorage.codegen property to false.

Accelerated access to primitive double array
Example:
val ds = sparkContext.parallelize(0 until 16, 1).map(i => Array.tabulate(1024 * 1024)(i => i.toDouble)).toDS
ds.count
ds.map(a => Array.tabulate(32)(i => a(i) + a(i * 1024))).collect

For more information about this change, see the following pull requests: 13680, 13704, 13758, 13909, 13911.

What's new in 2.0.0.0

The 2.0.0.0 release of the IBM packages for Apache Spark contains Apache Spark version 2.0.0 and IBM SDK, Java Technology Edition, version 8, service refresh 3.

Notable changes:

Fixed security issues:

GPU acceleration for machine learning algorithm

Support is now added for transparent GPU acceleration of the Alternating Least Squares machine learning algorithm, on the Intel x86 and Little Endian IBM Power platforms.

Use the --conf spark.mllib.ALS.useGPU=$SPARK_HOME/lib/ibm/libGPUALS.so option when running an Alternating Least Squares Spark job to enable GPU acceleration.

You do not need to change your existing code or configuration. However, CUDA 7.5 is required for GPU acceleration. You can download CUDA 7.5 from the NVIDIA website. To check that your system meets the minimum requirements, run ldd ibm/gpu/libGPUALS.so and ensure that there are no occurrences of the phrase "not found".

You can provide feedback and raise defects for this feature on the IBMSparkGPU/CUDA-MLlib_GitHub repository.

SizeEstimator footprint reduced

The SizeEstimator class can be used to estimate the size of a task. In this release, the cached Resilient Distributed Datasets footprint of SizeEstimator is reduced. This reduced memory consumption results in better performance.

Snappy version 0.4

Platform: Linux on IBM z Systems

org.iq80.snappy version 0.4 is packaged with this release. This version of Snappy fixes a bug that could cause data corruption in some situations.

Unaligned memory access is no longer required

Platform: Linux on IBM z Systems and IBM Power systems

In this release, unaligned memory access is no longer required to run jobs off-heap on these platforms.

Known issues in IBM packages for Apache Spark version 2

Netty security issue (CVE-2016-4970)

IBM recommends that you do not configure and use netty-tcnative and the Netty OpenSslEngine with version 2.0.2.0 and lower of the IBM Development Package for Apache Spark, due to CVE-2016-4970.

IBM recommends using version 2.1.0.0 and above of the package, which resolves this issue.

If you cannot update your installation, you should consider testing a configuration workaround that is published for this CVE. For more information, see the "Workarounds and Mitigations" section of the IBM Security Bulletin for CVE-2016-4970.

Tungsten issue in a mixed Endian cluster

Spark SQL functions that use Tungsten optimizations are not currently supported in a mixed Endian cluster. For more information about the issue, including any updates, see Spark-12778.

[{"Product":{"code":"SSAHSS","label":"Development Package for Apache Spark"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"General","Platform":[{"code":"PF016","label":"Linux"}],"Version":"2.0","Edition":"All Editions","Line of Business":{"code":"","label":""}}]

Tips

IBM packages for Apache Spark version 2: Latest news and information

News

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?