In this blog post, we will tell you about the addition of the Distributed Asynchronous Object Storage (DAOS) open-source object store system to our growing portfolio of highly automated and quick deploying IBM Cloud VPC-based high-performance computing infrastructure. We’ll start with an overview of the technology behind this system and move to the automation features, how to use it and how it performs.
This newly available Terraform/Ansible-based automation system exemplifies the paradigm shift that is taking place in the world of high-performance computing (HPC). Capabilities that until very recently could only be realized with a dedicated data center and the requisite investments in hardware and personnel (along with months of planning, deploying and testing) can now, like this DAOS system, be built and deployed in 30 minutes.
According to the DAOS architecture overview, “DAOS is an open-source, software-defined, scale-out object store that provides high bandwidth and high IOPS storage containers to applications and enables next-generation data-centric workflows combining simulation, data analytics and machine learning.”
Unlike the traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to exploit new technologies and is extremely lightweight since it operates end-to-end (E2E) in the user space with full OS bypass. DAOS offers a shift away from an I/O model designed for block-based and high-latency storage to one that inherently supports fine-grained data access and unlocks the performance of the next-generation storage technologies.
The automation described in this blog will build a DAOS storage system on the proven IBM Cloud VPC infrastructure. A DAOS system consists of multiple components working together. At the heart of DAOS are the storage servers that are built on IBM Cloud VPC bare metal servers. For our part, we have tested with up to four 48-core servers, but larger configurations are certainly possible. Each individual DAOS server offers the following:
The system can be built with as many compute nodes as desired using any IBM Cloud VPC profile. There are a wide range of profiles available that can accommodate any computing task that offer the following:
Each cluster will also include one small instance that combines bastion and DAOS administration functions.
DAOS is a true embodiment of the Infrastructure as Code ideal. Building your DAOS object store system on IBM Cloud begins with a public Github repository containing complete instructions and configuration files and the Terraform and Ansible scripts to build out a cluster to your specification.
Cluster creation using the DAOS IBM Cloud automation scripts is best discussed as implementing two phases. The first phase consists of executing a set of Terraform scripts to provision cloud resources. The first step under this first phase consists of filling out the Terraform Variables file(s) that specify the cluster attributes. Next, Terraform is set in motion with the apply command. From there, Terraform will proceed to do the following:
The time required to complete the above steps depends on the desired cluster size and can be estimated based on the Terraform Cloud resource provision times listed below.
The final step in the Terraform provision process creates the admin/bastion node. As part of that node’s provision process, its cloud-init function will be employed via user_data to automatically kick off the second phase of cluster configuration. The following are the major steps of the second phase:
The above packages are retrieved from the official DAOS package repository. The time to complete the above steps can be estimated using the Ansible Playbook install and configure line.
The automation scripts build a cluster that employs simple and effective security practices to get you started:
From there, it is expected that you will employ the rich set of tools supplied by the IBM Cloud and the DAOS storage system to tailor these default security measures to suit your security practices as you put your cluster into production.
The timings we will discuss in this section were measured on varying cluster configurations in real experiments and can be used as a guideline. As always, your results may vary to some degree.
Cluster creation times were tested for three different cluster sizes:
The total time to create a cluster ranged from 27 minutes for the 1×4 to 31 minutes for the 4×16, with the 2×8 falling predictably in between at just under 30 minutes. The creation time for the cluster is split between the time for Terraform to provision resources and the time for the Ansible scripts to configure the storage cluster. For the 4×16, the split is 18 minutes for Terraform and 13 minutes for Ansible. The time to destroy a cluster and return the resources to the cloud took between four minutes for the 1×4 and six minutes for the 4×16.
The total time to create a DAOS cluster is modest given the capabilities and features of the completed cluster. The provision times are kept down because many of the resources are provisioned concurrently using, in this case, Terraform’s default parallelism, which allows up to 10 simultaneous operations. This parallelism also explains why larger clusters require only small increases in total time.
To give you a preview of the performance of DAOS on the IBM Cloud, we tested an internal development release of DAOS that contains performance features that will be available in the upcoming 2.4 version, which is expected to release in late Spring 2023. It should be considered a preview of things to come and a demonstration of what is possible.
Testing was done on a cluster with 4 storage nodes with 48 cores and 8 NVMe devices in each (bx2d-metal-96×384 profile). 16 compute nodes employing the cx2-16×32 profile (16 vCPU,32GB memory) were used.
For testing, we used the well-known IO500 benchmark employing this DAOS specific methodology and obtained the following results:
Individual IO500 test results
IO500 Score
From our point of view, these results are quite competitive. You can judge for yourself by viewing the IO500 results featured on the main page of the IO500 web site.
Not long ago, distributed storage and large compute clusters were the province of the data center and were known for being slow to deploy, expensive and inflexible, which presented large challenges to even well-staffed and funded data centers. As we have shown in this blog, the advent of technologies like DAOS, IBM Cloud and Terraform are rapidly putting that behind us. They point to a future of quickly deployable, flexible, economical, and highly performant HPC. We hope you will consider making this journey with DAOS and IBM Cloud. Please visit the DAOS on IBM Cloud automation repository and see how easy it is to get started building your own DAOS compute cluster on IBM Cloud.