Using Spark in RStudio
Although the RStudio IDE cannot be started in a Spark with R environment runtime, you can use Spark in your R scripts and Shiny apps by accessing Spark kernels programmatically.
RStudio uses the sparklyr
package to connect to Spark from R. The sparklyr
package includes a dplyr
interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.
By default, RStudio is FIPS-tolerant, but if you need to use sparklyr
to connect to a Spark cluster, you must load the digest package from a library that is not FIPS-compliant. When you do that, RStudio will no longer be FIPS-tolerant.
For more information on FIPS, refer to Services that support FIPS.
You can connect to Spark from RStudio:
- By connecting to a Spark kernel that runs locally in the RStudio container in IBM Watson Studio
- By connecting to a remote Spark kernel that runs outside of IBM Watson Studio in an Analytics Engine powered by Apache Spark service instance.
RStudio includes sample code snippets that show you how to connect to a Spark kernel in your applications for both methods.
To use Spark in RStudio after you have launched the IDE:
-
Locate the
ibm_sparkaas_demos
directory under your home directory and open it. The directory contains the following R scripts:- A readme with details on the included R sample scripts
spark_kernel_basic_local.R
includes sample code of how to connect to a local Spark kernelspark_kernel_basic_remote.R
includes sample code of how to connect to a remote Spark kernel- The files
sparkaas_flights.R
andsparkaas_mtcars.R
are two examples of how to use Spark in a small sample application
-
Use the sample code snippets in your R scripts or applications to help you get started using Spark.
Connecting to Spark from RStudio
To connect to Spark from RStudio using the Sparklyr
R package, you need a Spark with R environment. You can either use the default Spark with R environment that is provided or create a custom Spark with R environment. To create
a custom environment, see Creating environment templates.
Follow these steps after you launch RStudio in an RStudio environment:
If you want to connect to Spark from a FIPS-enabled cluster, run this code first:
library(digest, lib.loc='/opt/not-FIPS-compliant/R/library')
library(sparklyr)
Use the following sample code to get a listing of the Spark environment details and to connect to a Spark kernel from your RStudio session:
# load spark R packages
library(ibmwsrspark)
library(sparklyr)
# load kernels
kernels <- load_spark_kernels()
# display kernels
display_spark_kernels()
# get spark kernel Configuration
conf <- get_spark_config(kernels[1])
# Set spark configuration
conf$spark.driver.maxResultSize <- "1G"
# connect to Spark kernel
sc <- spark_connect(config = conf)
Then to disconnect from Spark, use:
# disconnect
spark_disconnect(sc)
Examples of these commands are provided in the readme under /home/wsuser/ibm_sparkaas_demos
.
Parent topic: RStudio