Table of contents

Extending analytics using Spark (Analytics Engine powered by Apache Spark)

You can use Analytics Engine powered by Apache Spark as a compute engine to run analytical and machine learning jobs.

Service The Analytics Engine powered by Apache Spark service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Each time you submit a job, a dedicated Spark cluster is created for the job. You can specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.

When a job completes, the cluster is automatically cleaned up so that the resources are available for other jobs. The service also includes interfaces that enable you to analyze the performance of your Spark applications and debug problems.

You can submit jobs to Spark clusters in two ways:

  • Specifying a Spark environment definition for a job in an analytics project
  • Running Spark job APIs

Spark environments in analytics projects

If you have the Watson Studio service installed, the Analytics Engine powered by Apache Spark service automatically adds a set of default Spark environment definitions to analytics projects. You can also create custom Spark environment definitions in a project.

You can see Spark environment definitions on the Environments page in an analytics project.

See Spark environments.

Spark APIs

You can run these types of workloads with Spark jobs APIs:

  • Spark applications that run Spark SQL
  • Data transformation jobs
  • Data science jobs
  • Machine learning jobs

See Getting started with Spark applications.

Learn more