IBM Support

Building and using an RStudio Docker image for Linux on POWER

Technical Blog Post


Abstract

Building and using an RStudio Docker image for Linux on POWER

Body

imageIn IBM Spectrum Conductor with Spark, you can integrate 3rd party interactive tools such as notebooks into your Spark instance group. The resources to run these 3rd party tools are scheduled together with other applications in your cluster. IBM Spectrum Conductor with Spark ships only Jupyter and Zeppelin notebooks out-of-box. It is, however, quite easy for you to add other notebooks (for example, RStudio) by yourself.

Using containers is an easy way to package software with everything needed to run it (i.e. code, runtime, system tools, system libraries, and settings). Containers are also useful for sealing interactive tools for integration with IBM Conductor with Spark. You can find Docker images through a Docker registry (for example, the public Docker registry), but most of these images are built for the Linux X86 platform. If you are using Linux on POWER, you might need to build an image yourself.

This blog details the basic steps on how to build the Docker container image for RStudio on Linux on POWER (Ubuntu). Users can use the images built in this blog to in order to add the RStudio notebook to their IBM Spectrum Conductor with Spark cluster.

 

System requirements

In order to build and use RStudio as a Dockerized notebook in IBM Spectrum Conductor with Spark on Linux on POWER, ensure that you meet the following prerequisites:

  • The Linux on POWER host(s) must have:
    • The Docker engine installed
    • At least 10 GB of free space on the hard disk for the image and intermediate files
    • Internet access to pull reference images and download required software
  • An IBM Spectrum Conductor with Spark cluster installed and running
  • Optional: A Docker registry to distribute the container image

 

Building the RStudio Docker image on Linux on POWER

There are many ways to build a Docker image. In this blog, we are creating a Dockerfile for the Docker build command. And you can find the Dockerfile here.

In this Dockerfile, we use ppc64le/ubuntu:latest as the base for the image and then install the required system tools, dependency software, and set locale.

RStudio provides a pre-built rstudio-server package for various X86 platforms, but not for the Linux on POWER platform. Fortunately, RStudio also provides source code for other platforms to build.  You can download the source from the RStudio website, and follow the official instructions provided by RStudio to build it inside of your container. In the Dockerfile provided, we installed all the tools and the pre-built rstudio-server package within the same container.  The image that is generated by this approach might be large; if you want to reduce the size of the image, you can build the package in one container, and install the binary into a separate container without the build tools and dependencies.

In addition to the basic R environment and RStudio binary, there are other packages that are required for RStudio to work with Spark. The Sparklyr package provides the R interface for Apache Spark. Install Sparklyr and other required packages such as “formatR”, “ggplotR2”, “knitr” and any other packages that you need after the rstudio-server is built. You can also do this later in your RStudio instance.

In order for your Spark application in RStudio to connect to the Spark master service, you must add our client package as well as our examples. You can find them here.

Now you understand the Dockerfile to build the image for Linux on POWER, and can build the image by running the following command in the location of the Dockerfile:

   docker build . –f Dockerfile.ppc64le –t cwsrstudio:1.0

After the image is built, you must ensure that it can be distributed to all hosts in the cluster. There are public Docker registry services for you to upload your image. If you do not want to make your image public, you can build your own private Docker registry and put the image in that registry so that other hosts can pull from it. If you do not want to create a registry service, you can save the image as a file and then manually distribute the image file onto all hosts and load it respectively.

 

Using the RStudio Docker image as a notebook in IBM Spectrum Conductor with Spark

Our blog post Running RStudio in IBM Spectrum Conductor with Spark 2.2.0 using Sparklyr and Docker introduces how to use the RStudio notebook image with IBM Spectrum Conductor with Spark on X86 platform. Once you build your Docker image for Linux on POWER, you can follow the instructions in that blog to add the notebook into your IBM Spectrum Conductor with Spark cluster, create a Spark instance group and enable the notebook, and then create and use the notebook.

 

Now that you understand how to build an RStudio Docker image on Linux on POWER for IBM Spectrum Conductor with Spark, try it out! You can download the IBM Spectrum Conductor with Spark 2.2.1 evaluation version here.

If you have any questions, post them in our forum or join us on Slack!

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163581