A demonstration of how to leverage the Ray compute framework to perform typical daily operational tasks at cloud-scale on large volumes of data on object storage.
The cloud takes on an increasingly essential role in mainstream daily IT operations, including routine procedures for working with data. Most IT users are familiar with working with data stored on some local disk or network-mounted drives. They are using shells with interactive command line tools like
cp to copy data from one drive or folder to another. However, a benefit of using a cloud is that you can store multiple orders of magnitude larger volumes of data, and you have orders of magnitude more processing power than on just your local machine. But many users struggle to make effective use of this because these capabilities are not offered as conveniently as using a shell terminal and simple commands like
This motivated us to craft a small tooling framework that abstracts cloud data storage and typical standard operations on it in a shell command experience that looks and feels similar to current single machine terminals. In the backend, we use the highly versatile scale-out programming and runtime framework Ray to transparently and radically scale out the data processing behind the scenes on IBM Cloud's infrastructure.
As the first use case, we have implemented copying large volumes of data from one "cloud volume" (a bucket in a cloud object storage service) to another "cloud volume." To the user, this is made available as a simple command line tool to call on their client machine.
The setup and usage is very easy and straightforward. The user just performs the following sequence of operations — each represents as a single command line invocation:
- Generate a Ray cluster configuration for IBM Cloud (only needs to be done once, 1 minute)
- Execute the Ray command:
- Deploy and launch the Ray cluster in IBM Cloud (2 minutes)
- Run the command (for instance, the copy command in our initial use case here)
- Tear down the Ray cluster in IBM Cloud (10 seconds)
It is indeed as simply as calling four CLI commands in a row. Of course, you can run multiple Ray commands (step 2.2) with the same Ray cluster and keep it running (steps 2.1 and 2.3) across multiple command invocations. The following section provides an overview of the architecture. In the section afterwards, you can see a detailed walk through of the sequence of commands.
Scale out architecture
The central element is the IBM Cloud Virtual Private Cloud (VPC) infrastructure that provides the virtual machines (i.e., Virtual Server Instances (VSIs)) to run the Ray cluster with the scale-out command in an on-demand fashion:
On the client machine, you run the
ray CLI tool to provision and start the VSIs with the Ray cluster and stop and decommission it when done. Before doing that, there is another client tool —
lithopscloud — that needs to be run once. This generates the desired VPC and VSI resource configuration to be used by the
ray CLI tool when provisioning the Ray cluster. To connect to the IBM Cloud Gen2 VPC infrastructure, the
ray CLI uses the
ray CLI launches a Ray cluster to execute a Ray command, which accesses and processes the data in your cloud storage in a scale-out manner. Our Ray command implementation employs a map-reduce pattern to first generate a series of parallelizable tasks (for instance, by listing all objects in a bucket and creating one task for each). After that, the Ray command executes these tasks in the Ray cluster in parallel according to the amount of resources that you configured when you used the
lithopscloud tool beforehand.
Open a terminal and clone the Ray-Commands git repo:
Now enter the cloned directory:
Create a client environment and install all required tools by running the following:
This installs the previously mentioned
lithopscloud CLI tools along with the
gen2-connector plugin to interact with the IBM Cloud Gen2 VPC infrastructure.
Generate a Ray cluster configuration for IBM Cloud
First, let's generate a Ray cluster configuration for VPC in IBM Cloud. You only need to do this once. You can reuse that configuration for many Ray cluster deployments. Launch
lithopscloud in your terminal, select Ray Gen2 as compute backend (it stands for Ray on IBM Cloud Gen2 VPC infrastructure). Now paste an IBM API Key (if needed, create a new one for your user here). Select the IBM Cloud region. Select an existing VPC in your account or choose Create new VPC. Select the availability zone where you want to create the cluster nodes:
You can use the defaults for resource group, security group and new ssh keys unless you know better. There are multiple options for a base image. We tested with ubuntu 20 minimal:
Now specify the min/max corridor for number of worker nodes and the node type according to your expected resource needs for your job. More cores in total results in more parallelism.
In our example, we use a static setup of 4 nodes (min and max). You also need to select a virtual machine node type. In our example, we pick the 8 cores by 32 GB memory nodes. In total, our Ray cluster will then consist of five such nodes — four worker nodes plus a Ray head node of the same size. Ray will also use the head node to run Ray processing tasks:
lithopscloud now generates the configuration in a yaml file and prints out its location, as can be seen in the above screenshot.
Mark and copy the full path to your clipboard and run the following:
This stores a file named
cluster.yaml in the current directory that includes all libraries and configuration that we need to run our Ray commands.
Deploy and launch the Ray cluster in IBM Cloud
You can now bring up the Ray cluster by running the following:
Type "y" when being asked whether you want to launch a new Ray cluster:
The command runs for about two minutes:
Take note of the command output about your new Ray cluster and how you can submit workload, attach to the cluster or ssh into the head node. In the following, though, we will be using the prepared Ray Command CLI script to submit workload from this repository.
You may also want to check out the cloud console for your VPC instance that was created or selected when you ran
lithopscloud earlier. There you can see the Ray cluster VMs that were launched with the Ray up command:
Submit ray_cp scale-out command to the Ray cluster
ray_cp.sh script represents the scale-out implementation of a copy command for large amounts of object storage data using the Ray cluster that we just brought up. By default, it will interactively prompt you for all required parameters about your source location and target location on object storage.
To streamline its usage across multiple invocations and to allow you to embed it into automated scripts, you can provide some or all of the requirement parameters as environment variables. In the example screenshot below, you can see that we set the HMAC credentials (access key and secret key) and HTTPS endpoints for the input bucket and for the output bucket as environment variables. We did not set the input and output bucket names and prefix paths as variables; therefore, the tool prompts us for them interactively:
After the input is gathered, the Ray command first does a listing of all objects in the source object storage bucket and prepares a set of copy tasks for them. In our example in the screenshot below, it found 8,988 objects to copy. The copy command implementation does a simple batching of multiple objects to copy with a single task. The batch size is currently set to four, so we get 2,247 tasks. Now the command submits these Ray tasks to the Ray cluster as you can see in the screenshot:
As soon as the first tasks complete, the tool starts to report the progress of both, the accumulated data volume that is copied so far and how many of the overall tasks have finished:
When all tasks have finished, the copy command prints a short elapse time and throughput summary:
In the our example, we used a Ray cluster with four workers with eight vCPUs per worker. Since a simple object copy is primarily network bound, we can safely overcommit the vCPUs by a decent factor. The current copy command implementation overcommits by a factor of five. With the four workers plus one Ray head node, eight vCPUs per node and with the vCPU overcommitting, the Ray scheduler can run (4 + 1) * 8 * 5 = 200 copy tasks in parallel. With this scheduling setup, we could copy the 2 TB of data in 8,988 objects across cloud regions from IBM Cloud Dallas region in US to IBM Cloud Frankfurt region in EU in about 20 minutes.
A larger Ray cluster would have allowed for even more parallelism and shorter elapse time for the 2 TB copy. In addition to Ray cluster size — and depending on the size and number of your objects to copy — you can tune the batch size and vCPU overcommitting to achieve optimal performance.
The copy command also has a simple retry logic for tasks implemented to account for the fact that in large scale-out processing, the odds are that one of the many tasks fails for whatever transient reason.
Tear down the Ray cluster
When you're done with your Ray commands, you should deprovision the Ray cluster again to avoid accumulating costs for idling resources in your account. This takes just a few seconds can be done using
ray down cluster.yam:
You can bring up a fresh Ray cluster quickly (about two minutes) at any time by running the ray up command again.
In this article, we have demonstrated a simple way to leverage Ray compute framework to perform typical daily operational tasks at cloud-scale on large volumes of data on object storage. The Ray-Commands repository that was using in this demonstration includes an implementation of a copy command. However, it is entirely open source and it can and should be extended with further such commands by applying the pattern of the already available copy command implementation.
Learn more about IBM Cloud Virtual Private Cloud (VPC).