Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC

Leveraging warehouse-scale computing as needed

Discover methods and tools to build a compute node and small cluster application that can scale with on-demand high-performance computing (HPC) by leveraging the cloud. This series takes an in-depth look at how to address unique challenges while tapping and leveraging the efficiency of warehouse-scale on-demand HPC. The approach allows the architect to build locally for expected workload and to spill over into on-demand cloud HPC for peak loads. Part 1 focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Sam B. Siewert, Assistant Professor, University of Alaska Anchorage

Sam Siewert photoDr. Sam Siewert is an assistant professor in the Computer Science and Engineering department at the University of Alaska Anchorage. He is also an adjunct assistant professor at the University of Colorado at Boulder and teaches several summer courses in the Electrical, Computer, and Energy Engineering department. As a computer system design engineer, Dr. Siewert has worked in the aerospace, telecommunications, and storage industries since 1988. Ongoing interests as a researcher and consultant include scalable systems, computer and machine vision, hybrid reconfigurable architecture, and operating systems. Related research interests include real-time theory, digital media, and fundamental computer architecture.

23 April 2013

Also available in Chinese Russian

Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.

Moving to high-performance computing

The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren't for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.

The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.

Power to computing

Power to computing can be measured in terms of a typical performance metric per Watt—for example, FLOPS/Watt or input/output per second/Watt for computing and I/O, respectively. Furthermore, any computing facility can be seen as a plant for converting Watts into computational results, and a gross measure of good plant design is power use efficiency (PUE), which is simply the ratio of total facility power over that delivered to computing equipment. A good value today is 1.2 or less. One reason for higher PUEs is inefficient cooling methods, administrative overhead, and lack of purpose-built facilities compared to cloud data centers (see Resources for a link to more information).

Changes in scalable computing architecture focus over time include:

  • Early focus on a fast single processor (uniprocessor) to push the stored-program arithmetic logic unit central processor to the highest clock rates and instruction throughput possible:
    • John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
    • Up to at least 1984 (and maybe longer), the general rule was "the processor makes the computer."
    • Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
    • IBM's focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
  • More recently since 1994, HPC is evolving to a few custom MPP and mostly off-the-shelf clusters, using both custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged Ethernet (10G, 40G) and InfiniBand:
    • The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
    • As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
    • John Gage at Sun Microsystems (now Oracle) stated that "the network is the computer," referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
    • Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
  • Warehouse-scale computing and the cloud emerge with focus on MapReduce and what HPC would call embarrassingly parallel applications:
    • The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
    • Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
    • Luiz André Barroso states that "the data center is the computer," a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
    • Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
  • On-demand cloud HPC services expand, with an emphasis on clusters, storage, coprocessors and elastic scaling:
    • Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
    • IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
    • Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.

Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.

Figure 1. TOP500 evolution to clusters and MPP since 1994
Image showing the evolution to clusters

The cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.

High-definition digital video computer vision: a scalable HPC case study

Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don't think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.

To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let's look at a simple edge-finder transform. The includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.

Figure 2. HD video Canny edge transform
Image showing a Canny edge transform

Leveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).

Augmented reality and video analytics

Video analytics involves collection of structured (database) information from unstructured video (files) and video streams—for example, facial recognition. Much of the early focus has been on security and automation of surveillance, but applications are growing fast and are being used now for more social applications, e.g. facial recognition, perhaps not to identify a person but to capture and record their facial expression and mood (while shopping). This technology can be coupled with augmented reality, whereby the analytics are used to update a scene with helpful information (such as navigation data). Video data can be compressed and uplinked to warehouse-scale data centers for processing so that the analytics can be collected and information provided in return not available on a user's smart phone. The image processing is compute intensive and involves big data storage, and likely a scaling challenge (see Resources for a link to more information).

Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.

Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it's not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it's clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you'll see why HPC might be needed just to segment a scene at 60 frames/s.

The challenge of making algorithms parallel

HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.

The power of MapReduce

The MapReduce concept is generally associated with Google and the open source Hadoop project (from Apache Software Foundation), but any parallel computation must employ this concept to obtain speed-up, whether done at a node or cluster level with Java™ technology or at a thread level for a nonuniform memory access (NUMA) shared memory. For applications like digital video analytics, the mapping is data intensive, so it makes sense to move the function to the data (in the mapping stage), but either way, the data to be processed must be mapped and processed and the results combined. A clever mapping avoids data dependencies and the need for synchronization as much as possible. In the case of image processing, for CV, the mapping could be within a frame, at the frame level, or by groups of pictures (see Resources).

Key tools for designing cluster scaling applications for cloud HPC on demand include the following:

  • Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
  • MPI is a library that can be linked into a cluster parallel application to assist with mapping of processing to each node, synchronization, and reduction of results. Although you can use MPI to implement MapReduce, unlike Hadoop, it typically moves data (in messages) to program functions running on each node (rather than moving code to the data). In the final video analytics article in this series, I will provide a thread and MPI cluster-scalable version of the capture-transform code. Here, I provide the simple code for a single thread and node to serve as a reference. Run it and Linux dstat at the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560x1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
  • Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
  • OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.

The future of on-demand cloud HPC

This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.


Continuous HD digital camera transform exampletransform-example.zip123KB
Grid threaded prime generator benchmarkhpc_cloud_grid.tar.gz3KB
High-resolution image for transform benchmarkCactus-12mpixel.zip12288KB



Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks

  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • Cloud digest

    Complete cloud software, infrastructure, and platform knowledge.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

Zone=Cloud computing, Industries
ArticleTitle=Cloud scaling, Part 1: Build a compute node or small cluster application and scale with HPC