DAOS on IBM Cloud VPC
23 March 2023
5 min read
How to start building your own Distributed Asynchronous Object Storage (DAOS) compute cluster on IBM Cloud.

In this blog post, we will tell you about the addition of the Distributed Asynchronous Object Storage (DAOS) open-source object store system to our growing portfolio of highly automated and quick deploying IBM Cloud VPC-based high-performance computing infrastructure. We’ll start with an overview of the technology behind this system and move to the automation features, how to use it and how it performs.

This newly available Terraform/Ansible-based automation system exemplifies the paradigm shift that is taking place in the world of high-performance computing (HPC). Capabilities that until very recently could only be realized with a dedicated data center and the requisite investments in hardware and personnel (along with months of planning, deploying and testing) can now, like this DAOS system, be built and deployed in 30 minutes.

 
Distributed Asynchronous Object Storage (DAOS)

According to the DAOS architecture overview, “DAOS is an open-source, software-defined, scale-out object store that provides high bandwidth and high IOPS storage containers to applications and enables next-generation data-centric workflows combining simulation, data analytics and machine learning.”

Unlike the traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to exploit new technologies and is extremely lightweight since it operates end-to-end (E2E) in the user space with full OS bypass. DAOS offers a shift away from an I/O model designed for block-based and high-latency storage to one that inherently supports fine-grained data access and unlocks the performance of the next-generation storage technologies.

VPC infrastructure

The automation described in this blog will build a DAOS storage system on the proven IBM Cloud VPC infrastructure. A DAOS system consists of multiple components working together. At the heart of DAOS are the storage servers that are built on IBM Cloud VPC bare metal servers. For our part, we have tested with up to four 48-core servers, but larger configurations are certainly possible. Each individual DAOS server offers the following: 

  • A 48 or 96 core configuration
  • Up to 768 GB of memory
  • A 100 Gb network interface
  • 8 or 16 NVMe devices with up to 50 TB of storage

The system can be built with as many compute nodes as desired using any IBM Cloud VPC profile. There are a wide range of profiles available that can accommodate any computing task that offer the following:

  • 2 – 200 vCPUs
  • 4 – 5600 GB of memory
  • Up to 80Gb of network bandwidth

Each cluster will also include one small instance that combines bastion and DAOS administration functions.

The automation script repository

DAOS is a true embodiment of the Infrastructure as Code ideal. Building your DAOS object store system on IBM Cloud begins with a public Github repository containing complete instructions and configuration files and the Terraform and Ansible scripts to build out a cluster to your specification.

Cluster creation

Cluster creation using the DAOS IBM Cloud automation scripts is best discussed as implementing two phases. The first phase consists of executing a set of Terraform scripts to provision cloud resources. The first step under this first phase consists of filling out the Terraform Variables file(s) that specify the cluster attributes. Next, Terraform is set in motion with the apply command. From there, Terraform will proceed to do the following:

  • Set up SSH keys.
  • Provision cloud network resources.
  • Create security groups to control access to cluster resources.
  • Provision storage servers.
  • Provision compute clients.
  • Provision the admin/bastion node.

The time required to complete the above steps depends on the desired cluster size and can be estimated based on the Terraform Cloud resource provision times listed below.

The final step in the Terraform provision process creates the admin/bastion node. As part of that node’s provision process, its cloud-init function will be employed via user_data to automatically kick off the second phase of cluster configuration. The following are the major steps of the second phase:

  • Install Ansible.
  • Retrieve Ansible playbooks from a git repository.
  • Run Ansible playbooks to do the following:
    • Install DAOS server packages on the DAOS servers.
    • Install DAOS client packages on the compute clients.
    • Install DAOS admin packages on the DAOS admin instance.
    • Configure all of the above and start DAOS.

The above packages are retrieved from the official DAOS package repository. The time to complete the above steps can be estimated using the Ansible Playbook install and configure line.

Security

The automation scripts build a cluster that employs simple and effective security practices to get you started:

  • User-supplied SSH keys
  • A jump host [bastion]
  • Firewall with only the SSH port open and restricted to your specified CIDR
  • All nodes in the cluster can only be accessed from within the VPC

From there, it is expected that you will employ the rich set of tools supplied by the IBM Cloud and the DAOS storage system to tailor these default security measures to suit your security practices as you put your cluster into production.

Time required to create a DAOS cluster

The timings we will discuss in this section were measured on varying cluster configurations in real experiments and can be used as a guideline. As always, your results may vary to some degree.

Cluster creation times were tested for three different cluster sizes:

  • One storage node with four compute nodes (1×4)
  • Two storage with four compute (2×8)
  • Four storage with sixteen compute (4×16)

The total time to create a cluster ranged from 27 minutes for the 1×4 to 31 minutes for the 4×16, with the 2×8 falling predictably in between at just under 30 minutes. The creation time for the cluster is split between the time for Terraform to provision resources and the time for the Ansible scripts to configure the storage cluster. For the 4×16, the split is 18 minutes for Terraform and 13 minutes for Ansible. The time to destroy a cluster and return the resources to the cloud took between four minutes for the 1×4 and six minutes for the 4×16.

The total time to create a DAOS cluster is modest given the capabilities and features of the completed cluster. The provision times are kept down because many of the resources are provisioned concurrently using, in this case, Terraform’s default parallelism, which allows up to 10 simultaneous operations. This parallelism also explains why larger clusters require only small increases in total time.

DAOS storage performance preview

To give you a preview of the performance of DAOS on the IBM Cloud, we tested an internal development release of DAOS that contains performance features that will be available in the upcoming 2.4 version, which is expected to release in late Spring 2023. It should be considered a preview of things to come and a demonstration of what is possible.

Testing was done on a cluster with 4 storage nodes with 48 cores and 8 NVMe devices in each (bx2d-metal-96×384 profile). 16 compute nodes employing the cx2-16×32 profile (16 vCPU,32GB memory) were used.

For testing, we used the well-known IO500 benchmark employing this DAOS specific methodology and obtained the following results:

Individual IO500 test results

  • ior-easy-write:            38.687956 GiB/s         time: 346.296 seconds
  • mdtest-easy-write:      381.627554 kIOPS     time: 435.799 seconds
  • ior-hard-write:            7.424557 GiB/s           time: 391.666 seconds
  • mdtest-hard-write:      157.793821 kIOPS     time: 428.967 seconds
  • find:                            276.135728 kIOPS     time: 845.487 seconds
  • ior-easy-read:              32.216002 GiB/s         time: 415.812 seconds
  • mdtest-easy-stat:         236.416785 kIOPS     time: 702.827 seconds
  • ior-hard-read:              6.685864 GiB/s           time: 434.946 seconds
  • mdtest-hard-stat:         227.684156 kIOPS     time: 297.614 seconds
  • mdtest-easy-delete:     151.984559 kIOPS     time: 1094.312 seconds
  • mdtest-hard-read:        211.442229 kIOPS     time: 320.404 seconds
  • mdtest-hard-delete:1   33.868635 kIOPS       time: 526.874 seconds

IO500 Score

  • Bandwidth: 15.771351 GiB/s
  • IOPS: 210.470718 kiops
  • TOTAL: 57.614299

From our point of view, these results are quite competitive. You can judge for yourself by viewing the IO500 results featured on the main page of the IO500 web site.

Conclusion

Not long ago, distributed storage and large compute clusters were the province of the data center and were known for being slow to deploy, expensive and inflexible, which presented large challenges to even well-staffed and funded data centers. As we have shown in this blog, the advent of technologies like DAOS, IBM Cloud and Terraform are rapidly putting that behind us. They point to a future of quickly deployable, flexible, economical, and highly performant HPC. We hope you will consider making this journey with DAOS and IBM Cloud. Please visit the DAOS on IBM Cloud automation repository and see how easy it is to get started building your own DAOS compute cluster on IBM Cloud.

Author
Greg Mewhinney Senior Engineer, IBM Cloud Performance Engineering
Paul Mazzurana Senior Engineer, IBM Cloud Performance Engineering