Exotic HPC architectures with custom-scaled processor cores and shared memory interconnection networks are being rapidly replaced by on-demand clusters that leverage off-the-shelf general purpose vector coprocessors, converged Ethernet at 40 Gbit/s per link or more, and multicore headless servers. These new HPC on-demand cloud resources resemble what has been called warehouse-scale computing, where each node is homogeneous and headless and the focus is on total cost of ownership and power use efficiency overall. However, HPC has unique requirements that go beyond social networks, web search, and other typical warehouse-scale computing solutions. This article focuses on what the system builder and HPC application developer can do to most efficiently scale your system and application.
Moving to high-performance computing
The TOP500 and Green500 supercomputers (see Resources) since 1994 are more often not custom designs, but rather designed and integrated with off-the-shelf headless servers, converged Ethernet or InfiniBand clustering, and general-purpose graphics processing unit (GP-GPU) coprocessors that aren't for graphics but rather for single program, multiple data (SPMD) workloads. The trend in high-performance computing (HPC) away from exotic custom processor and memory interconnection design to off-the-shelf—warehouse-scale computing—is based on the need to control total cost of ownership, increase power efficiency, and balance operational expenditure (OpEx) and capital expenditure (CapEx) for both start-up and established HPC operations. This means that you can build your own small cluster with similar methods and use HPC warehouse-scale resources on-demand when you need them.
The famous 3D torus interconnection that Cray and others used may never fully go away (today, the TOP500 is one-third massively parallel processors [MPPs] and two-thirds cluster architecture for top performers), but focus on efficiency and new OpEx metrics like Green500 Floating Point Operations (FLOPs)/Watt are driving HPC and keeping architecture focused on clusters. Furthermore, many applications of interest today are data driven (for example, digital video analytics), so many systems not only need traditional sequential high performance storage for HPC checkpoints (saved state of a long-running job) but more random access to structured (database) and unstructured (files) large data sets. Big data access is a common need of traditional warehouse-scale computing for cloud services as well as current and emergent HPC workloads. So, warehouse-scale computing is not HPC, but HPC applications can leverage data center-inspired technology for cloud HPC on demand, if designed to do so from the start.
Changes in scalable computing architecture focus over time include:
- Early focus on a fast single processor (uniprocessor) to push
the stored-program arithmetic logic unit central processor to the highest
clock rates and instruction throughput possible:
- John von Neumann, Alan Turing, Robert Noyce (founder of Intel), Ted Hoff (Intel universal processor proponent), along with Gordon Moore see initial scaling as a challenge to scaling digital logic and clock a processor as fast as possible.
- Up to at least 1984 (and maybe longer), the general rule was "the processor makes the computer."
- Cray Computer designs vector processors (X-MP, Y-MP) and distributed memory multiprocessors interconnected by a six-way interconnect 3D torus for custom MPP machines. But this is unique to the supercomputing world.
- IBM's focus early on was scalable mainframes and fast uniprocessors until the announcement of the IBM® Blue Gene® architecture in 1999 using a multicore IBM® POWER® architecture system-on-a-chip design and a 3D torus interconnection. The current TOP500 includes many Blue Gene systems, which have often occupied the LINPACK-measured TOP500 number one spot.
- More recently since 1994, HPC is evolving to a few custom MPP and mostly
off-the-shelf clusters, using both
custom interconnections (for example, Blue Gene and Cray) and off-the-shelf converged
Ethernet (10G, 40G) and InfiniBand:
- The TOP500 has become dominated by clusters, which comprise the majority of top-performing HPC solutions (two-thirds) today.
- As shown in the TOP500 chart by architecture since 1994, clusters and MPP dominate today (compared to single instruction, multiple data [SIMD] vector; fast uniprocessors; symmetric multiprocessing [SMP] shared memory; and other, more obscure architectures).
- John Gage at Sun Microsystems (now Oracle) stated that "the network is the computer," referring to distributed systems and the Internet, but low-latency networks in clusters likewise become core to scaling.
- Coprocessors interfaced to cluster nodes via memory-mapped I/O, including GP-GPU and even hybrid field-programmable gate array (FPGA) processors, are used to accelerate specific computing workloads on each cluster node.
- Warehouse-scale computing and the cloud emerge with focus on MapReduce
and what HPC would call embarrassingly parallel applications:
- The TOP500 is measured with LINPACK and FLOPs and so is not focused on cost of operations (for example, FLOPs/Watt) or data access. Memory access is critical, but storage access is not so critical, except for job checkpoints (so a job can be restarted, if needed).
- Many data-driven applications have emerged in the new millennium, including social networks, Internet search, global geographical information systems, and analytics associated with more than a billion Internet users. This is not HPC in the traditional sense but warehouse-computing operating at a massive scale.
- Luiz André Barroso states that "the data center is the computer," a second shift away from processor-focused design. The data center is highly focused on OpEx as well as CapEx, and so is a better fit for HPC where FLOPs/Watt and data access matter. These Google data centers have a PUE less than 1.2—a measure of total facility power consumed divided by power used for computation. (Most computing enterprises have had a PUE of 2.0 or higher, so, 1.2 is very low indeed. See Resources for more information.)
- Amazon launched Amazon Elastic Compute Cloud (Amazon EC2), which is best suited to web services but has some scalable and at least high-throughput computing features (see Resources).
- On-demand cloud HPC services expand, with an emphasis on clusters, storage,
coprocessors and elastic scaling:
- Many private and public HPC clusters occupy TOP500, running Linux® and using common open source tools, such that users can build and scale applications on small clusters but migrate to the cloud for on-demand large job handling. Companies like Penguin Computing, which features Penguin On-Demand, leverage off-the-shelf clusters (InfiniBand and converged 10G/40G Ethernet), Intel or AMD multicore headless nodes, GP-GPU coprocessors, and scalable redundant array of independent disks (RAID) storage.
- IBM Platform computing provides IBM xSeries® and zSeries® computing on demand with workload management tools and features.
- Numerous universities and start-up companies leverage HPC on demand with cloud services or off-the-shelf clusters to complement their own private services. Two that I know well are the University of Alaska Arctic Region Supercomputing Center (ARSC) Pacman (Penguin Computing) and the University of Colorado JANUS cluster supercomputer. A common Red Hat Enterprise Linux (RHEL) open source workload tool set and open architecture allow for migration of applications from private to public cloud HPC systems.
Figure 1 shows the TOP500 move to clusters and MPP since the mid-1990s.
Figure 1. TOP500 evolution to clusters and MPP since 1994
The cloud HPC on-demand approach requires well-defined off-the-shelf clustering, compute nodes, and tolerance for WAN latency to transfer workload. As such, these systems are not likely to overtake top spots in the TOP500, but they are likely to occupy the Green500 and provide efficient scaling for many workloads and now comprise the majority of the Top500.
High-definition digital video computer vision: a scalable HPC case study
Most of us deal with compressed digital video, often in Motion Picture Experts Group (MPEG) 4 format, and don't think of the scale of even a high-definition (HD) web cam in terms of data rates and processing to apply simple image processing analysis. Digital cinema workflow and post-production experts know the challenges well. They deal with 4K data (roughly 4-megapixel) individual frames or much higher resolution. These frames might be compressed, but they are not compressed over time in groups of pictures like MPEG does and are often lossless compression rather than lossy.
To start to understand an HPC problem that involves FLOPs, uncompressed data, and tools that can be used for scale-up, let's look at a simple edge-finder transform. The transform-example.zip includes Open Computer Vision (OpenCV) algorithms to transform a real-time web cam stream into a Sobel or Canny edge view in real time. See Figure 2.
Figure 2. HD video Canny edge transform
Leveraging cloud HPC for video analytics allows for deployment of more intelligent smart phone applications. Perhaps phone processors will someday be able to handle real-time HD digital video facial recognition, but in the mean time, cloud HPC can help. Likewise, data that originates in data centers, like geographic information systems (GIS) data, needs intensive processing for analytics to segment scenes, create point clouds of 3D data from stereo vision, and recognize targets of interest (such as well-known landmarks).
Sometimes, when digital video is collected in the field, the data must be brought to the computational resources; but if possible, digital video should only be moved when necessary to avoid encoding to compress and decoding to decompress for viewing. Specialized coprocessors known as codecs (coder/decoder) are designed to decode without software and coprocessors to render graphics (GPUs) exist, but to date, no CV coprocessors are widely available. Khronos has announced an initiative to define hardware acceleration for OpenCV in late 2012, but work has only just begun (see Resources). So, to date, CV remains more of an HPC application that has had attention primarily from digital cinema, but this is changing rapidly based on interest in CV on mobiles and in the Cloud.
Although all of us imagine CV to be implemented on mobile robotics, in our heads-up displays for intelligent transportation, and on visors (like Google Goggles that are now available) for personal use, it's not clear that all of the processing must be done on the embedded devices or that it should be even if it could. The reason is data: Without access to correlated data center data, CV information has less value. For example, how much value is there in knowing where your are without more mapping and GIS data to help you with where you want to go next? Real-time CV and video analytics are making progress, but they face many challenges, including huge storage requirements, high network bit rates for transport, and significant processing demands for interpretation. Whether the processing is done by cloud HPC clusters or embedded systems, it's clear that concurrency and parallel processing will play a huge role. Try running a simple Hough linear transform on the 12-megapixel cactus photo I took, and you'll see why HPC might be needed just to segment a scene at 60 frames/s.
The challenge of making algorithms parallel
HPC with both clusters and MPP requires coding methods to employ many thread of execution on each multicore node and to use message-passing interfaces (MPIs) and basic methods to map data and code to process resources and collect results. For digital video, the mapping can be simple if done at a frame level. Within a frame is more difficult but still not bad other than the steps of segmenting and restitching frames together.
Key tools for designing cluster scaling applications for cloud HPC on demand include the following:
- Threading is the way in which a single application (or Linux process) is one address space on one cluster node and can be designed to use all processor cores on that node. Most often, this is done with Portable Operating System Interface for UNIX® (POSIX) Pthreads or with a library like OpenMP, which abstracts the low-level details of POSIX threading. I find POSIX threading to be fairly simple and typically write Pthread code as can be seen in the hpc_cloud_grid.tar.gz example. This example maps threads to the over-number space for prime number searching.
- MPI is a library that can be linked into a cluster parallel
application to assist with mapping of processing to each node, synchronization,
and reduction of results. Although you can use MPI to implement MapReduce,
unlike Hadoop, it typically moves data (in messages) to program functions
running on each node (rather than moving code to the data). In the final
video analytics article in this series, I will provide a thread and MPI
cluster-scalable version of the capture-transform code.
Here, I provide the
simple code for a single thread and node to serve as a reference. Run it and
dstatat the same time to monitor CPU, I/O, and storage use. It is a resource-intensive program that computes Sobel and Canny transforms on a 2560x1920-pixel image. It should run on any Linux system with OpenCV and a web cam.
- Vector SIMD and SPMD processing can be accomplished on Intel and AMD nodes with a switch to enable during compilation or, with more work, by creation of transform kernels in CUDA or OpenCL for off-load to a GPU or GP-GPU coprocessor.
- OpenCV is highly useful for video analytics, as it includes not only convenient image capture, handling, and display functions but also most of the best image processing transforms used in CV.
The future of on-demand cloud HPC
This articles makes an argument for cloud HPC. The goal here is to acquaint you with the idea and some of the challenging, yet compelling applications (like CV) as well as to introduce you to methods for programming applications that can scale on clusters and MPP machines. In future articles, I will take the CV example further and adapt it for not only threading but also for MPI so that we can examine how well it scales on cloud HPC (in my case, at ARSC on Pacman or JANUS). My research involves comparison of tightly coupled CV coprocessors (that I am building using an Altera Stratix IV FPGA I call a computer vision processing unit [CVPU]). I am comparing this to what I can achieve with CV on ARSC for the purpose of understanding whether environmental sensing and GIS data are best processed like graphics, with a coprocessor, or on a cluster or perhaps with a combination of the two. The goals for this research are lofty. In the case of CVPU, the CV/graphics Turing-like test I imagine is one in which the scene that the CVPU parses can then be sent to a GPU for rendering. Ideally, the parsed/rendered image would be indistinguishable from the true digital video stream. When rendered scenes and the ability to analyze them reaches a common level of fidelity, augmented reality, perceptual computing, and video analytics will have amazing power to transform our lives.
|Continuous HD digital camera transform example||transform-example.zip||123KB|
|Grid threaded prime generator benchmark||hpc_cloud_grid.tar.gz||3KB|
|High-resolution image for transform benchmark||Cactus-12mpixel.zip||12288KB|
- According to the TOP500 list, based on LINPACK benchmark that you can download from Intel (also available as LINPACK Java), clusters are the most prevalent HPC solution.
- For efficiency and OpEx/CapEx cost control, the Green500 list is perhaps more interesting than the TOP500.
- If you are new to HPC, you'll find the developerWorks article "High-performance Linux clustering, Part 1: Clustering fundamentals" (Aditya Narayan, September 2005) a great place to start; likewise, if you've never set up a Linux cluster (Beowulf or OSCAR), the developerWorks article "High-performance Linux clustering, Part 2: Build a working cluster" (Aditya Narayan, October 2005) should help you get going.
- If your cluster involves big data, then you'll need scalable storage and Ceph is a great option for setting up big data object storage. You'll also no doubt want to build on a RAID volume, but Linux and hardware RoC (RAID on Chip) make this easier today. For really scalable data, HPC on-demand may be the best way to go rather than trying to build this on your own.
- See the developerWorks article, "Cloud-based education, Part 2: High-performance computing for education" (Sam Siewert, January 2012).
- Universities that leverage HPC on demand with cloud services are the University of Alaska ARSC Pacman and the University of Colorado JANUS cluster supercomputer.
- I found the presentation "Scaling Challenges for Warehouse Scale Computers" quite useful for understanding what it really is and how it's different from and similar to scalable HPC cloud solutions.
- The latency of data access to memory for NUMA nodes in a cluster can reduce speed-up, as examined in "Optimizing Google's Warehouse Scale Computers: The NUMA Experience". So, although clustering has large-grained MapReduce value, each node must still have efficient fine-grained processing and memory access.
- Video analytics focused on automation of closed-circuit television surveillance early on, as described in Video analytics for physical security, but it has grown to include automation for interactive sales and marketing and use in augmented reality and perceptual computing systems support between the cloud and mobile devices.
- The book Computer Vision: Models, Learning, and Inference by Simon J.D. Prince (Cambridge UP, 2012) provides a thorough technical treatment of CV algorithms, augmented reality tracking, and new directions for CV based on inference.
- The book Computer Architecture: A Quantitative Approach, 5th Edition by John Hennessy and David Patterson (Elsevier, 2011) does a good job of summarizing and comparing warehouse-scale computing to traditional server, HPC, and desktop computing.
- In the article "Using Intel Streaming SIMD Extensions and Intel Integrated Performance Primitives to Accelerate Algorithms", I wrote that use of SIMD instructions on Intel architectures should get you going with GNU gcc or the Intel compiler for SIMD.
- For SPMD on GPU or GP-GPU or multicore like the Intel® Many Integrated Core Architecture (MICA), I recommend OpenCL, but you may also want to look into NVIDIA CUDA. Altera is adding support for OpenCL for the Stratix V FPGA coprocessor card as well and will generally support OpenCL.
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- Check out the IBM Redbook, IBM Platform Computing Solutions by Dino Quintero, Scott Denham, et al. (2012).
- Follow developerWorks on Twitter.
- Watch developerWorks demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- I have always found the Lawrence Livermore National Laboratory Pthreads documentation quite useful.
- For computer vision, OpenCV implements most of the best-known algorithms for image transform, segmentation, registration, and object/shape recognition. The Khronos Group has initiated a new standard to define hardware acceleration for OpenCV called OpenVX, which would imply that PCIe cards like GPUs will one day be available for CV or CVPUs that can perform the inverse of rendering scenes and instead parse them into scene descriptions.
- For MPI, OpenMPI should meet all of your MP needs and work on typical RHEL cloud HPC systems and your local cluster as well as simple laptop or desktop Linux systems like Ubuntu LTS (what I run just to build and pretest before scaling). I found Installing Open MPI 1.6.3 by Matt Reid to be helpful for getting MPI going on my single-node Ubuntu LTS desktop and laptop systems. However, you will also need to set up a Linux cluster to truly benefit from MPI: The TechTinkering article Setting up a Beowulf Cluster Using Open MPI on Linux by Lawrence Woodman provides a great overview for how to set up a Linux Beowulf cluster. For more in-depth discussion and guidance on Linux cluster setup, I have found the book Building Clustered Linux Systems by Robert W. Lucke (Prentice Hall, 2004) useful. Porting solutions from open MPI to RHEL HPC clusters was therefore much easier, given the ability to test locally.
- Some companies that provide either preintegrated cluster solutions or on-demand cloud HPC include: IBM Platform Computing, Penguin Computing, Amazon, PiCloud.com, cloudscaling.com, and sabalcore.com, to name just a few. Some are more oriented toward warehouse-scale computing for data centers and some more toward traditional HPC, but with the growing number of both on demand and system solution providers, numerous options are available.
- Take a look at the new low-cost Intel time-of-flight camera and point cloud processing technology.
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.
- Join a developerWorks community cloud computing group.
- Read all the great cloud blogs on developerWorks.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.