How and why to run Ray — an open technology for fast and simple distributed computing — on IBM Cloud Code Engine.
What is IBM Cloud Code Engine?
IBM Cloud Code Engine is a fully managed, serverless platform that runs your containerized workloads, including web apps, microservices, event-driven functions and batch jobs with run-to-completion characteristics. Code Engine even builds container images for you from your source code. Because these workloads are all hosted within the same Code Engine platform, all of them can seamlessly work together. The Code Engine experience is designed so that you can focus on writing code and not on the infrastructure that is needed to host it.
Advanced users can benefit from the open Kubernetes API that Code Engine exposes in order to run technologies like Ray.
What is Ray?
Ray is an open technology for “fast and simple distributed computing.” It makes it easy for data scientists and application developers to run their code in a distributed fashion. It also provides a lean and easy interface for distributed programming with many different libraries, best suited to perform machine learning and other intensive compute tasks.
The following are a few features offered by Ray:
- Single machine code can be parallelized with little-to-zero code changes.
- Scale everywhere: Run the same code on laptops, multi-core machines, any cloud provider or on a Kubernetes cluster.
- Out-of-the-box scalable machine learning libraries. There is a large ecosystem of applications, libraries and tools on top of the core Ray to enable complex applications.
- It is open source.
As described above, Ray does also run on Kubernetes. Since Code Engine exposes Kubernetes APIs and does run containers at a high scale in a serverless fashion, we were looking at bringing the two together in order to “boost your serverless compute.”
Why should I run Ray on Code Engine?
- You don’t want to manage clusters or infrastructure. Instead, you want to value your time and focus on your job and task.
- You don’t want to wait for your local computer to process a large data set, but want to simply outsource the heavy compute into the cloud.
- You don’t want to waste money by running infrastructure or containers when you have nothing to run, but to have an automated scaling to zero not introducing any sort of charges.
- You don’t want to use different programming models, but want to have the same developer experience locally and in the cloud.
The idea is to run the Ray nodes as containers in the namespace that belongs to a Code Engine project. The configuration of the containers, the scaling and submission of tasks are completely done by Ray. Code Engine is just providing the compute infrastructure to create and run the containers.
For further reading about the Code Engine architecture, check out “Learning about Code Engine architecture and workload isolation.”
For further reading about Ray and how autoscaling works, you get an overview in “Ray Cluster Overview.”
How to run your Ray task on IBM Cloud Code Engine in a few steps
Before you can get started, make sure you have the latest CLI’s for Code Engine, Kubernetes and Ray installed.
Install the pre-requisites for Code Engine and Ray
- Get ready with IBM Code Engine
- Set up the Kubernetes CLI
- Install Python3 libraries for Ray and Kubernetes:
- Follow the instructions here for your operating system and Python3 version. For example, in our macOS-based installation, we used Python 3.7 (
brew install firstname.lastname@example.org) and installed Ray accordingly with
pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-macosx_10_13_intel.whl.
- Install Kubernetes libraries:
pip3 install kubernetes
- Verify you can run
ray in your terminal and that you can see the ray command line help.
Step 1: Define your Ray cluster
In order to start a Ray cluster, you need to define the cluster specification and where to run the Ray cluster. Since Code Engine exposes the Kubernetes API, we can use the Kubernetes provider from Ray.
To tell Ray which Kubernetes environment to use, you would need to run the following commands:
- If not already done, create a project in Code Engine:
- Select the Code Engine project and switch the `kubectl` context to the project:
- Extract the Kubernetes namespace assigned to your project. The namespace can be found in the NAME column in the output of the command:
- Export the namespace:
With the following command you can download a basic Ray cluster definition and customize it for your namespace:
The downloaded example describes a Ray cluster with the following characteristics:
- A cluster named
example-cluster with up to 10 workers using the Kubernetes provider pointing to your Code Engine project
- A head node type with 1 vCPU and 2 GB memory using the ray image rayproject/ray:1.5.1
- A worker node type with 0.5 vCPU and 1 GB memory using the ray image rayproject/ray:1.5.1
- The default startup commands to start Ray within the container and listen to the proper ports
- The autoscaler upscale speed is set to 10 for quick upscaling in this short and simple demo
Please see the section “One final note for advanced Kubernetes users” below for more information.
Step 2: “Ray up” your cluster
Now you can start the Ray cluster by running
ray up example-cluster.yaml.This command will create the Ray head node as Kubernetes Pod in your Code Engine project. When you create the Ray cluster for the first time, it can take up to three minutes until the Ray image is downloaded from the Ray repository. Ray will emit “SSH still not available” error messages until the image has been downloaded and the pod has started.
Step 3: Submit a very simple but distributed Ray program
The following very simple Ray program example illustrates how tasks can be executed in the remote Ray cluster. Function f simulates a workload that is called for 200 times. The 200 workloads will be distributed among the Ray cluster nodes:
After having copied the above program to a file example.py, you can run the program by submitting it to the Ray cluster:
This will copy the Python program to the Ray head node and start it there. In parallel, the autoscaler will begin to scale up the Ray cluster with additional worker nodes. The Python program will be copied to the newly created worker nodes and run there in addition to the head node. The autoscaler scales up the Ray cluster in chunks until the maximum number of workers (10) is reached as defined in example-cluster.yaml with available_node_types.worker_node.max_workers. Finally, one minute (see idle_timeout_minutes in example-cluster.yaml) after the program has terminated, the autoscaler starts to downscale the Ray cluster until all worker nodes are deleted and only the head node is left.
Example output: Program output and autoscaler event messages
- 172.30.192.43 is the head node
inside f(<index>/200), host=<ip> reflects the n-th invocation of function f, index=1,2,…200
- The output ends with a summary that shows the distribution of the 200 function calls among the 11 nodes of the Ray cluster (1 head node and 10 worker nodes). Since we have started in a Ray cluster with no initial worker nodes, the distribution is not equal among the nodes. It took some time until all workers were created by the autoscaler. When you re-submit the example within less than a minute where all worker nodes still run, the distribution would be more like an ideal, equal distribution, until all 200 tasks are terminated:
The output ends with a summary:
Step 4: Monitor your cluster
You can monitor the execution of the Ray tasks and the autoscaling of resources by running the following:
Alternatively, you can bring up the Ray dashboard and monitor the cluster with your browser:
You can type http://127.0.0.1:8265 in your browser to view the dashboard:
Step 5: “Ray down” your cluster
If you’re done, you can simply teardown the cluster by running
ray down example-cluster.yaml.
One final note for advanced Kubernetes users
Let’s explain a few specifics of IBM Cloud Code Engine that might be interesting for advanced Kubernetes users. When running your Kubernetes workload (Pods, Deployments) in Code Engine, the following aspects should be kept in mind:
- A default service account exists that has specific permissions and cannot be changed.
- Roles and role bindings cannot be managed, and the existing service accounts must be used.
- Resource requests and resource limits must be identical.
- Image pull secrets can be created and used as expected to pull images from private registry.
- ImagePullPolicy must be set to Always or kept unset.
- Config maps and secrets can be managed and used as expected to configure the Pods.
- Services must be of type ClusterIP only.
The example above illustrated a lean example of how to run a Ray cluster in a IBM Cloud Code Engine project in a pure serverless fashion without the need to deal with any infrastructure or Kubernetes cluster management.