Using Jupyter Docker Stack To Run R Notebooks
JeanFrancoisPuget 2700028FGP Comments (2) Visits (25083)
I'm a Python fan but I am cognizant of R being more popular than Python among data scientists. The combination of Pyth
Can we do the same with R? I decided to test a combination of R, Jupyter notebooks, and Docker. To make an unbiased test, I further decided to use R code not written by me (which probably is a good thing given I am a R novice). I selected the code from this interesting comparison between R and Python: R vs
Using Docker spared me all the trial and error process to inst
Our target architecture is better described with a picture.
We will run a Docker container inside a Docker engine. That Docker engine is a Boot2Docker VM created when we inst
The interest of this architecture is that we can run the exact same container in any Docker engine without any portability issue. In particular, many public cloud now offer Docker engines, see the list here. We can deploy our container to any of these seamlessly thanks to Docker. Before that, let's make sure we have a working container!
The first step is to install Docker. If you are using a Windows machine, then I published installation instructions in Installing Boot2Docker For Windows, and Installing Docker Toolbox On Windows. The latter is the new recommended way, but the former is still a valid option IMHO.
In the following, I assume you have a Docker installation up and running. I will use my local Docker Toolbox installation, but most steps can be replicated 'as is' with any other Docker engine.
Second step is to launch the Docker engine. Using my Docker Toolbox install, I click on the Docker Quickstart icon on the Windows desktop. This launches a bash terminal I will be using from now on.
The docker-machine tool can be used to manage docker engines. When Quickstart starts, it launches a default engine. On my machine its IP address is 192.168.99.100 as printed. As explained in Installing Docker Toolbox On Windows, I am using another engine called dev, I need to first stop the default engine, then start my dev engine before proceeding. If you are using the default engine then you can ignore this step, just keep track of the default engine IP address.
docker-machine stop default
Once dev is running, I print its IP address as I will need it to access my notebooks. This address is 192.168.99.101.
Third step is to download the image. Although Docker now provides a windows client, I find it easier to use the Docker client inside the Docker engine. We therefore log to that engine using the following. If you're using the default engine:
docker-machine ssh default
In my case I use:
docker-machine ssh dev
This gets us inside the Boot2Docker machine. We then download the image with:
docker pull jupy
This can take quite a while depending on your network connection. If you experience errors during the download, like time out, or unexpected EOF, simply run the command again. It is my experience that docker pull is rather robust in those cases.
Fourth step is to launch a container once the image download is complete. If we do this blindly, then our notebooks won't be persisted beyond the container existence. One way to add persistence is to mount a directory of my Windows machine as a volume in the container using the -v option. The command to launch the container is:
docker run -d -p 8888:8888 -v /c/U
Let's look at options in detail.
-d tells the container to run as long as it is not stopped
-p 8888:8888 maps ports. Notebooks listen to the 8888 port of their host. Their host is the container, hence notebooks listen to the 8888 internal port of the container. By default containers do not expose their internal ports outside. We can tell the container to expose its internal port via the -p option. In addition, we can select which external port is mapped to the internal port. For simplicity we use the same port, i.e. 8888.
Let me explain a bit more the volume definition with this picture.
The Docker engine (here a Boot2Docker VM) auto mounts the directory C:\Users\ of the host machine (here Windows) and attaches it to the /c/Users/ directory of the VM. Then the -v option above mounts the /c/U
We check that our container is running with
You should get a display similar to this one. We see that a fancy name has been given to the container. We can either use that name (elated_stallman), or use the beginning of the hex string (55e) to refer to that container if need be.
Running the container launches a Jupyter server that listens to port 8888 of the Docker engine. This Docker engine is a Boot2Docker VM, it is not the Windows machine itself. Therefore we need to use the address of the engine. Fortunately we took care of it, it is 192.168.99.101 if we use the dev engine, and 192.168.99.100 if we use the default engine. Note that these values may be different on your machine.
In our case, the Jupyter server is accessible at http
The working directory is the home directory of the user inside the docker container. We can find what it is by opening a terminal on Jupyter. We do so via the drop down menu on the right in Jupyter home page:
This runs bash where the notebook server is, i.e. inside our container. A simple call to pwd tells us we are in /home/jovyan/work. This is why we mounted our Windows directory to a subdirectory of /home/jovyan/work. We can see our directory when listing the content of /home/jovyan/work.
We can now create our first notebook. We first click on the notebook directory icon in the Jupyter page. This shows the content of that directory. This is the content of the C:\U
We can now create a notebook with the drop down menu on the right. Select R as option.
This launches a new R notebook:
Let's rename it into R-test using the left drop down menu.
We can now enter code. As said above, I used the code from http
We then enter our text. This text uses markdown for formatting. Basic markdown syntax can be found here. In our case we entered this text:
This notebook contains the R code from [R vs Python: head to head data anal
After executing the cell (via SHITF+ENTER) we get this. The text is rendered as html, and a new cell is created below it.
This code requires a dataset that can be downloaded from here. I saved it in my
I then entered each code snippet in a cell, then execute it via Shift+Enter to execute it and move to the next cell. We simply modified where the data set was compared to the original code.
So far so good. The warnings are not to be worried about, R simply tells us that some columns in our data set are not numerical columns, hence their mean cannot be computed.
Next code snippet triggers an error:
Issue is that we need to install the R package GGally. This can be done via a simple R call. We provide the name of the package and one of the cran repositories as argument to install.packages:
We will run this in a cell we create before the offending on. We click on our cell 5, then use the Insert drop down menu to insert a cell before it
Running it triggers another error.
Issue is that it tries to install it inside the spark installation. We rather install it in the main R library. We can get its location by executing library() in a cell we create above our cell.
Running it yields a list of all installed packages, for each library. This is shown in a popup window.
We see that the main library is located at /opt
This time the installation runs fine!
This is nice but we will reinstall the package each time we will rerun the notebook. Let's do some cleaning therefore (we'll see another way to install packages without messing with the notebook below). We first delete the two cells that contain the calls to library() and install.packages() using the Edit drop down menu. We then re execute the last cell. This time it executes fine, and we see that the expected display is there.
We can enlarge the notebook to see it all if we wish to. Anyway, let's proceed with additional code. Everything looks fine until this code snippet.
We must install the R package called randomForest. We could install it as we installed the previous package, via:
But, as before, we will need to remove that cell. Let's install the package directly from R without polluting our notebook instead. We will use the terminal we opened above. If you have closed that terminal, then you can create a new one. In it, we just type R on the command line.
Typing R starts it:
We then start the installation by entering the previous command. The installation completes gracefully.
Once the installation is complete, we quit R via CTRL-D and go back to our notebook. We need to restart the kernel in order to benefit from the newly installed package. We do so via selecting the restart item in the Kernel drop down menu.
We then select Clear all outputs & restart in the confirmation window that pops up.
The R kernel is rerun, and our notebook is refreshed. We then execute it all by selecting in the Cell drop down menu.
This executes cells one by one until the end or until an error happens.
This time everything goes fine, and we keep adding code. We will find yet another R package missing, namely rvest, that we install via
We could run this command via a cell in the notebook, or via a terminal. Let me show you a third way, this time by logging into the container from our quickstart terminal. Remember, we are logging into the dev VM in that terminal. We then type:
docker exec -it 55e bash
Remember, 55e is the start of the hex string that identifies the container. We could have used the container name too:
docker exec -it elated_stallman bash
This brings us inside the container in a bash shell. Typing R starts it:
We then cut and paste the install.packages command above. Once the package is installed, then we need to restart the kernel, and execute all cells, as documented before. With that we are able to run all the code of R vs
Let's pause for a moment. What did we had to do to run R code in our notebook? We basically had to install some R packages which is to be expected given the high number of available packages. The only minor issue was that it tried to install packages inside the Spark directory which is write protected. That was easy to fix. We have seen 3 ways to install packages. There are more, for instance using conda, but I'll stop here for now. Whether you install packages directly in a notebook cell, a Jupyter terminal, or by logging into the container, is a matter of taste. Select the one that fits you best.
I found that running R in a notebook was not only doable, but pleasant. I'd like to know if RStudio specialists would agree with me. I noticed two other minor issues. First, my notebook directory becomes flooded with pdf files created by R when rendering plots. I had to remove them manually after running the notebook. Second, I haven't found how to cut and paste into the Jupyter terminal.
Updated on October 27. Documented how to use Jupyter terminal instead of docker exec -it as suggested by Peter Parente comment.
Updated on October 25. A previous version of the jupy