What if I told you that it’s possible to teleport memory from one server and attach it to another server, without entering the datacenter building? You would probably: a) think you’re dreaming of a Star Trek episode; b) or better yet, think this is really cool and you could use it to improve the efficiency of your datacenter.
This is actually possible today, but you’re not on the USS Enterprise. You’re just getting to know what we refer to as memory disaggregation. A group of servers connected over a dedicated network fabric where each machine can dynamically borrow memory from any other machine, for the purpose of extending its own main system memory availability. It’s no different than removing a few memory DIMMs from one box and plugging them into another, without the need of actually doing it.
In a fully disaggregated datacenter, the classic infrastructure based on homogeneous servers is replaced by a set of machines that act as donors of specific hardware resources. In such a datacenter you would find servers equipped with only CPUs and minimal amount of memory for bootstrap, machines equipped only with lots of memory DIMMs and just a small CPU for control purposes, and lastly machines equipped only with other peripherals like disks, GPUs, FPGAs, etc. Basically, the datacenter is transformed into a huge pool of resources that can be orchestrated in software, and connected together to form logical servers that have just the resources needed by a specific workload. We often call this a Composable Datacenter.
The benefits brought by this paradigm shift are immense, including:
1) reduced energy consumption by powering-off unused components
2) increased flexibility in performing hardware components updates
3) increased flexibility to respond to dynamic changes in computational resources requests
4) eliminate the need for specialized machine configurations as a software defined datacenter can craft any server requirements for running any type of workloads.
Borrowing memory from a neighbouring server is the first step towards making the Composable Datacenter a reality.
At the Hot Chips conference, the IBM Power microprocessor development team has just unveiled the POWER10 microprocessor, which fully integrates this capability, coining the term “Memory Inception.” A processor chip in one system can be directly cabled to a processor chip in another system, linking those systems together and enabling them to share each other’s physical memory. Under the covers, the team constructed the capability by utilizing OpenCAPI  coherent attachment, specifically the memory home agent function, and the DMA write/read function, coupled to some additional logic that “tricks” each system into believing its attached to a device, instead of another POWER10 chip.
But while the development team was incorporating this capability into POWER10, our team at IBM Research Europe in Dublin, in collaboration with other colleagues from IBM Systems, have utilized the OpenCAPI function built into the POWER9 processor to prototype the memory disaggregation capability for software and application development, trailblazing and paving the way for POWER10 deployment. While the latency and bandwidth characteristics of the research prototype cannot approach those of a fully integrated POWER10 solution due to the extra FPGA circuitry in the path, the prototype is fully operational right now, using existing POWER9 systems, whereas the POWER10 processor and systems are still under development.
The IBM Research prototype is called ThymesisFlow, a full stack memory disaggregation prototype that leverages OpenCAPI coherent attachment to enable main system memory borrowing between POWER9 servers (e.g. IBM Power System AC922). In ThymesisFlow one machine can act as either a memory recipient (compute node), or as a memory donor (memory node). The link between machines is created using AlphaData 9V3  OpenCAPI enabled FPGAs.
On the software side we leverage Linux memory hot-plug for seamless attachment of disaggregated memory to a running machine. We have also started thinking about how to integrate this into a cloud infrastructure while building an orchestration layer that allows dynamic attachment/detachment of memory through an easy-to-use web interface.
How it Works
A ThymesisFlow compute node uses the OpenCAPI Lowest Point of Coherence (LPC) to present the FPGA as a window on the machine physical address space. This allows memory transactions towards a “remote” memory location to be transparently handled by our hardware design, and forwarded to the destination memory node. The operating system and applications will just see some more memory available in the system. Memory accesses to disaggregated memory do not involve any software component, enabling true byte-addressable disaggregated memory. This is possible only thanks to OpenCAPI coherent attachment. On the memory side instead, we leverage the OpenCAPI Accelerator mode, (C1 mode) that allows any off-chip peripheral to act as master and inject transactions directly into the POWER9 SoC system bus.
Basically, memory transactions initiated from a compute node are forwarded to the memory node via the ThymesisFlow hardware and through a dedicated circuit network fabric. Transactions are re-played on the memory node system bus and a response is sent back, if needed.
ThymesisFlow prototype top view.
The Software Stack Explained
Our software stack takes care of both the low-level memory attachment details, as well as dynamic orchestration of disaggregated memory in a multi-node system.
At the lowest level there is a separation between compute and memory side. The compute node will only hot-plug a physical memory region to the running Linux kernel. This will be enough to give applications transparent access to a disaggregated memory segment. On the memory side instead, we rely on a user process application that “borrows” the memory from the donor node. This process allocates the requested amount of memory and makes it available to the OpenCAPI accelerator implemented in the FPGA.
At a higher level we have designed an orchestration software that oversees the whole network of nodes for memory disaggregation. The orchestrator uses a graph to represent the network connecting all the nodes. Upon a new memory request, it dynamically identifies the best memory donor node according to its memory availability and network connectivity. Once the pair of nodes is identified, the orchestrator instructs the user process for memory on the memory node to allocate a memory buffer, and triggers the hot-plug of the new segment of disaggregated memory on the compute node. All this can be controlled from the comfort a web interface.
ThymesisFlow is now an open source hardware/software project in the OpenCAPI GitHub organization. Visit https://github.com/OpenCAPI/ThymesisFlow to get access to our full Verilog sources. The software stack will follow-up shortly.
If you haven’t had enough yet, we have a paper appearing at MICRO 2020 where we present a full system evaluation with real cloud workloads. Title: “ThymesisFlow: A Software-Defined, Hw/Sw co-Designed Interconnect Stack for Rack-Scale Memory Disaggregation”.
During the month of March, IBM Research put the spotlight on a number of women scientists and engineers, and asked them about their professional and personal motivations, journeys and experiences as women — and particularly, as women in STEM. They represent the breadth of career experiences at IBM Research, across disciplines, geographies, ethnicities, tenures and backgrounds, who share a passion for science and tech, as well as a commitment to help all women rise to meet their aspirations.
Hybrid cloud could ultimately enable a new era of discovery, using the best resources available at the right times, no matter the size or complexity of the workload, to maximize performance and speed while maintaining security.
We use AI to automatically break down the overall application by representing application code as graphs. Our AI relies on Graph Representation Learning – a popular method in deep learning. Graphs are a natural representation for software and applications. We translated the application to a graph where the programs become nodes. Their relationships with other programs become edges and determine the boundary to separate the nodes of common business functionality.