Cloud scaling, Part 2: Tour high-performance cloud system design advances

Learn how to leverage co-processing, nonvolatile memory, interconnection, and storage

Breakthrough device technology requires the system designer to re-think operating and application software design in order to realize the potential benefits of closing the access gap or pushing processing into the I/O path with coprocessors. Explore and consider how the latest memory, compute, and interconnection devices and subsystems can affect your scalable, data-centric, high-performance cloud computing system design. Breakthroughs in device technology can be leveraged for transition between compute-centric and the more balanced data-centric compute architectures.

The author examines storage-class memory and demonstrates how to fill the long-standing performance gap between RAM and spinning disk storage; details the use of I/O bus coprocessors (for processing closer to data); explains how to employ InfiniBand to build low-cost, high performance interconnection networks; and discusses scalable storage for unstructured data.

Sam B. Siewert, Assistant Professor, University of Alaska Anchorage

Sam Siewert photoDr. Sam Siewert is an assistant professor in the Computer Science and Engineering department at the University of Alaska Anchorage. He is also an adjunct assistant professor at the University of Colorado at Boulder and teaches several summer courses in the Electrical, Computer, and Energy Engineering department. As a computer system design engineer, Dr. Siewert has worked in the aerospace, telecommunications, and storage industries since 1988. Ongoing interests as a researcher and consultant include scalable systems, computer and machine vision, hybrid reconfigurable architecture, and operating systems. Related research interests include real-time theory, digital media, and fundamental computer architecture.



21 May 2013

Also available in Chinese Russian

Computing systems engineering has historically been dominated by scaling processors and dynamic RAM (DRAM) interfaces to working memory, leaving a huge gap between data-driven and computational algorithms (see Resources). Interest in data-centric computing is growing rapidly, along with novel system design software and hardware devices to support data transformation with large data sets.

The data focus in software is no surprise given applications of interest today, such as video analytics, sensor networks, social networking, computer vision and augmented reality, intelligent transportation, machine-to-machine systems, and big data initiatives like IBM's Smarter Planet and Smarter Cities.

The current wave of excitement is about collecting, processing, transforming, and mining the big data sets:

  • The data focus is leading toward new device-level breakthroughs in nonvolatile memory (storage-class memory, SCM) which brings big data closer to processing.
  • At the same time, input/output coprocessors are bringing processing closer to the data.
  • Finally, low-latency, high-bandwidth off-the-shelf interconnections like InfiniBand are allowing researchers to quickly build 3D torus and fat-tree clusters that used to be limited to the most exotic and expensive custom high-performance computing (HPC) designs.

Yet, the systems software and even system design often remain influenced by out-of-date bottlenecks and thinking. For example, consider threading and multiprogramming. The whole idea came about because of slow disk drive access; what else can a program do when waiting on data but run another one. Sure, we have redundant array of independent disks (RAID) scaling and NAND flash solid-state disks (SSDs), but as noted by IBM Almaden Research, the time scale differences of the access time gap are massive in human terms.

The access time gap between a CPU, RAM, and storage can be measured in terms of typical performance for each device, but perhaps the gap is more readily understood when put into human terms (as IBM Almaden has done for illustrative purposes).

If a typical CPU operation is similar to what a human can do in seconds, then RAM access at 100 times more latency is much like taking a few minutes to access information. However, by the same comparison, disk access at 100,000 times more latency compared to RAM is on the order of months (100 days). (See Figure 1.)

Figure 1. The data access gap
Image showing the data access gap

Many experienced computer engineers have not really thought hard about the 100 to 200 random I/O operations per second (IOPS) — it is the mechanical boundary for a disk drive. (Sure, sequential access is as high as hundreds of megabytes per second, but random access remains what it was more than 50 years ago, with the same 15K RPM seek and rotate access latency.)

Finally, as Almaden notes, tape is therefore glacially slow. So, why do we bother? For the capacity, of course. But how can we get processing to the data or data to the processing more efficiently?

Look again at Figure 1. Improvements to NAND flash memory for use in mobile devices and more recently SSD has helped to close the gap; however, it is widely believed that NAND flash device technology will be pushed to its limits fairly quickly, as noted by numerous system researchers (see Resources). The transistor floating gate technology used is already at scaling limits and pushing it farther is leading to lower reliability, so although it has been a stop-gap for data-centric computing, it is likely not the solution.

Instead, several new nonvolatile RAM (NVRAM) device technologies are likely solutions, including:

  • Phase change RAM (PCRAM): This memory uses a heating element to turn a class of materials known as chalcogenides into either a crystallized or amorphous glass state, thereby storing two states that can be programmed and read, with state retained even when no power is applied. PCRAM appears to show the most promise in the near term for M-type synchronous nonvolatile memory (NVM).
  • Resistive RAM (RRAM): Most often described as a circuit that is unlike a capacitor, inductor, or resistor, RRAM provides a unique relationship between current and voltage unlike other well-known devices that store charge or magnetic energy or provide linear resistance to current flow. Materials with properties called memristors have been tested for many decades but engineers usually avoid them because of their nonlinear properties and the lack of application for them. IEEE fellow Leon Chua describes them in "Memristor: The Missing Circuit Element." A memristor's behavior can be summarized as follows: Current flow in one direction causes electrical resistance to increase and in the opposite direction resistance decreases, but the memristor retains the last resistance it had when flow is re-started. As such, it can store a nonvolatile state, be programmed, and the state read. For details and even some controversy on what is and is not a memristor, see Resources.
  • Spin transfer torque RAM (STT-RAM): A current passed through a magnetic layer can produce a spin-polarized current that, when directed into a magnetic layer, can change its orientation via angular momentum. This behavior can be used to excite oscillations and flip the orientation of nanometer-scale magnetic devices. The main drawback is the high current needed to flip the orientation.

Consult the many excellent entries in Resources for more in-depth information on each device technology.

From a systems perspective, as these devices evolve, where they can be used and how well each might fill the access gap depends on the device's:

  • Cost
  • Scalability (device integration size must be smaller than a transistor to beat flash; less than 20 nanometers)
  • Latency to program and read
  • Device reliability
  • Perhaps most importantly, durability (how often it can be programmed and erased before it becomes unreliable).

Based on these device performance considerations, IBM has divided SCM into two main classes:

  • S-type: Asynchronous access via an I/O controller. Threading or multiprogramming is used to hide the I/O latency to the device.
  • M-type: Synchronous access via a memory controller. Think about this as wait-states for RAM access in which a CPU core stalls.

Further, NAND SSD would be considered fast storage, accessed via a block-oriented storage controller (much higher I/O rates but similar bandwidth to a spinning disk drive).

It may seem like the elimination of asynchronous I/O for data processing (except, of course, for archive access or cluster scaling) might be a cure-all for data-centric processing. In some sense it is, but systems designers and software developers will have to change habits. The need for I/O latency hiding will largely go away on each node in a system, but it won't go away completely. Clusters built from InfiniBand deal with node-to-node data-transfer latency with Message Passing Interface or MapReduce schemes and enjoy similar performance to this envisioned SCM node except when booting or when node data exceeds node working RAM size.

So, for scaling purposes, cluster interconnection and I/O latency hiding among nodes in the cluster is still required.

Moving processing closer to data with coprocessors

Faster access to big data is ideal and looks promising, but some applications will always benefit from the alternative approach of moving processing closer to data interfaces. Many examples exist, such as graphics (graphics processing units, GPUs), network processors, protocol-offload engines like the TCP/IP Offload Engine, RAID on chip, encryption coprocessors, and more recently, the idea of computer vision coprocessors. My research involves computer vision and graphics coprocessors, both at scale in clusters and embedded. I am working on what I call a computer vision processing unit, comparing several coprocessors that became more widely pursued with the 2012 announcement of OpenVX by Khronos (see Resources).

In the embedded world, such a method might be described as an intelligent sensor or smart camera, methods in which preliminary processing of raw data is provided by the sensor interface and an embedded logic device or microprocessor, perhaps even a multicore system on a chip (SoC).

In the scalable world, this most often involves use of a coprocessor bus or channel adapter (like PCI Express, PCIe, and Ethernet or InfiniBand); it provides data processing between the data source (network side) and the node I/O controller (host side).

Whether processing should be done or is more efficient when done in the I/O path or on a CPU core has always been a topic of hot debate, but based on an existence proof (GPUs and network processors), clearly they can be useful, waxing and waning in popularity based on coprocessor technology compared to processor. So, let's take a quick look at some of the methods:

Vector processing for single program, multiple data
Provided today by GPUs, general-purpose GPUs (GP-GPUs), and application processing units (APUs), the idea is that data can be transformed on its way to an output device like a display or sent to a GP-GPU/APU and transformed on a round trip from the host. "General purpose" implies more sophisticated features like double-precision arithmetic compared to single precision only for graphics-specific processing.
Many core
Traditional many-core coprocessor cards (see Resources) are available from various vendors. The idea is to lower cost and power consumption by using simpler, yet numerous cores on the I/O bus, with round-trip offloading of processing to the cards for a more capable but power-hungry and costly full-scale multicore host. Typically, the many-core coprocessor might have an order of magnitude more cores than the host and often includes gigabit or 10G Ethernet and other types of network interfaces.
I/O bus field-programmable gate arrays (FPGAs)
FPGA cards, most often used to prototype a new coprocessor in the early stages of development, can perhaps used as a solution for low-volume coprocessors as well.
Embedded SoCs
A multicore solution can be used in an I/O device to create an intelligent device like a stereo ranging or time-of-flight camera.
Interface FPGA/configurable programmable logic devices
A digital logic state machine can provide buffering and continuous transformation of I/O data, such as digital video encoding.

Let's look at an example based on offload and I/O path. Data transformation has obvious value for applications like the decoding of MPEG4 digital video, consisting of a GPU coprocessor in the path between the player and a display as shown in Figure 2 for the Linux® MPlayer video decoder and presentation acceleration unit (VDPAU) software interface to NVIDIA MPEG decoding on the GPU.

Figure 2. Simple video decode offload example
Image showing an example of a simple video decode offload

Likewise, any data processing or transformation that can be done in-bound or out-bound from a CPU host may have value, especially if the coprocessor can provide processing at a lower cost with great efficiency or with lower power consumption based on purpose-built processors compared to general-purpose CPUs.

To start to understand a GP-GPU compared to a multicore coprocessor approach, try downloading the two examples of a point spread function to sharpen the edges on an image (threaded transform example) compared with the GPU transform example. Both provide the same 320x240-pixel transformation, but in one case, the Compute Unified Device Architecture (CUDA) C code provided requires a GPU or GP-GPU coprocessor and, in the other case, either a multicore host or a many-core (for example, MICA) coprocessor.


So which is better?

Neither approach is clearly better, mostly because the NVRAM solutions have not yet been made widely available (except as expensive battery-backed DRAM or as S-type SCM from IBM Texas Memory Systems Division) and moving processing into the I/O data path has traditionally involved less friendly programming. Both are changing, though: Coprocessors are adopting higher-level languages like the Open Compute Language (OpenCL) in which code written for multicore hosts runs equally well on Intel MICA or Altera Startix IV/V architectures.

Likewise, all of the major computer systems companies are working feverishly to release SCM products, with PCRAM the most likely to be available first. My advice is to assume that both will be with us for some time and operating systems and applications must be able to deal with both. The memristor, or RRAM, includes a vision that resembles Isaac Asimov's fictional positronic brain in which memory and processing are fully integrated as they are in a human neural system but with metallic materials. The concept of fully integrated NVM and processing is generally referred to as processing in memory (PIM) or neuromorphic processing (see Resources). Scalable NVM integrated processing holds extreme promise for biologically inspired intelligent systems similar to the human visual cortex, for example. Pushing toward the goal of integrated NVM, with PIM from both sides, is probably a good approach, so I plan to keep up with and keep working on systems that employ both methods—coprocessors and NVM. Nature has clearly favored direct, low-level, full integration of PIM at scale for intelligent systems.


Scaling nodes with Infiniband interconnection

System designers always have to consider the trade-off between scaling up each node in a system and scaling out a solution that uses networking or more richly interconnected clustering to scale processing, I/O, and data storage. At some point, scaling the memory, processing, and storage a single node can integrate hits a practical limit in terms of cost, power efficiency, and size. It is also often more convenient from a reliability, availability, and servicing perspective to spread capability over multiple nodes so that if one needs repair or upgrade, others can continue to provide service with load sharing.

Figure 3 shows a typical InfiniBand 3D torus interconnection.

Figure 3. Example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)
Image showing an example of InfiniBand 4x4x4 3D torus with 1152 nodes (SDSC Gordon)

In Figure 3, the 4x4x4 shown is for the San Diego Supercomputing Center (SDSC) Gordon supercomputer, as documented by Mellanox, which uses a 36-port InfiniBand switch to connect nodes to each other and to storage I/O.

InfiniBand, Converged Enhanced Ethernet iSCSI (CEE), or Fibre Channel is the most often used scalable storage interface for access to big data. This storage area network (SAN) scaling for RAID arrays is used to host distributed, scalable file systems like Ceph, Lustre, Apache Hadoop, or the IBM General Parallel File System (GPFS). Use of CEE and InfiniBand for storage access using the Open Fabric Alliance SCSI Remote Direct Memory Access (RDMA) Protocol and iSCSI Extensions for RDMA is a natural fit for SAN storage integrated with an InfiniBand cluster. Storage is viewed more as a distributed archive of unstructured data that is searched or mined and loaded into node NVRAM for cluster processing. Higher-level data-centric cluster processing methods like Hadoop MapReduce can also be used to bring code (software) to the data at each node. These topics are big-data-related topics that I describe more in the last part of this four-part series.


The future of data-centric scaling

This articles makes an argument for systems design and architecture that move processors closer to data-generating and -consuming devices, as well as simplification of memory hierarchy to include fewer levels, leveraging lower-latency, scalable NVM devices. This defines a data-centric node design that can be further scaled with low-latency off-the-shelf interconnection networks like InfiniBand. The main challenge with data-centric computing is not instructions-per-second or floating-point-operations-per-second only, but rather IOPS and the overall power efficiency of data processing.

In Part 1 of this series, I uncovered methods and tools to build a compute node and small cluster application that can scale with on-demand HPC by leveraging the cloud. In this article I detailed such high-performance system design advances as co-processing, nonvolatile memory, interconnection, and storage.

In Part 3 in this series I provide more in-depth coverage of a specific data-centric computing application — video analytics. Video analytics includes applications such as facial recognition for security and computer forensics, use of cameras for intelligent transportation monitoring, retail and marketing that involves integration of video (for example, visualizing yourself in a suit you're considering from a web-based catalog), as well as a wide range of computer vision and augmented reality applications that are being invented daily. Although many of these applications involve embedded computer vision, most also require digital video analysis, transformation, and generation in cloud-based scalable servers. Algorithms like Sobel transformation can be run on typical servers, but algorithms like the generalized Hough transform, facial recognition, image registration, and stereo (point cloud) mapping, for example, require the NVM and coprocessor approaches this article discussed for scaling.

In the last part of the series, I deal with big data issues.


Downloads

DescriptionNameSize
GPU accelerated image transformsharpenCUDA.zip644KB
Grid threaded comparisonhpc_dm_cloud_grid.zip1.08MB
Simple image for transform benchmarkCactus-320x240-pixel.ppm.zip206KB

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • Cloud digest

    Complete cloud software, infrastructure, and platform knowledge.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Industries
ArticleID=930579
ArticleTitle=Cloud scaling, Part 2: Tour high-performance cloud system design advances
publish-date=05212013