Extending analytics using Spark (Analytics Engine powered by Apache Spark)

You can use Analytics Engine powered by Apache Spark as a compute engine to run analytical and machine learning jobs.

IBM Analytics Engine powered by Apache Spark provides managed service for consuming Apache Spark with additional features such as auto-scaling, resource quota, and queuing. You can run Spark application interactively by using Jupyter Notebooks and Scripts, both Python and R. The applications can also be run by using jobs from Notebook, Deployment space, or by using the Spark service instance. The IBM Analytics Engine powered by Apache Spark creates on-demand Spark clusters and runs workloads using offerings like Spark applications, Spark kernels, and Spark labs.

Service The IBM Analytics Engine powered by Apache Spark service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Each time you submit a job, a dedicated Spark cluster is created for the job. You can specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.

When a job completes, the cluster is automatically cleaned up so that the resources are available for other jobs. The service also includes interfaces that enable you to analyze the performance of your Spark applications and debug problems.

In IBM Cloud Pak for Data, you can run Spark workloads in two ways:

  • In a notebook that runs in a Spark environment in a project in Watson Studio
  • Outside Watson Studio, in an IBM Analytics Engine powered by Apache Spark instance using Spark job APIs

Spark environments in projects

If you have the Watson Studio service installed, the IBM Analytics Engine powered by Apache Spark service automatically adds a set of default Spark environment templates to projects. You can also create custom Spark environment templates in a project.

You can see Spark environment templates under Templates on the Environments page on the Manage tab of your project.

For more details, see Spark environments.

Spark APIs

You can run Spark workloads directly in IBM Analytics Engine powered by Apache Spark using Spark job APIs.

You can run these types of workloads with Spark jobs APIs:

  • Spark applications that run Spark SQL
  • Data transformation jobs
  • Data science jobs
  • Machine learning jobs

See Getting started with Spark applications.

Learn more

Parent topic: Analyzing data and building models