Cloud-based education, Part 2: High-performance computing for education

Economics, availability, architecture, and scale

Part 1 of this series focused on the tools, methods, and strategies for using the cloud. This article explores high-performance computing (HPC) from supercomputers to warehouse-scale computers and how the cloud makes supercomputing for educators and students more economical and available, with a broader range of architectures and scaling elasticity. In the past, students were often only able to read about supercomputers. Today, with the cloud, hands-on in HPC is much more feasible. This article provides a starting point for cloud-based HPC educational strategy.


Sam Siewert (, Senior Instructional Faculty, University of Colorado

Sam SiewertDr. Sam Siewert is an embedded system design engineer who has worked in the aerospace, telecommunications, and storage industries since 1988. He presently teaches at the University of Colorado Boulder as Senior Instructional Faculty in the Embedded Systems Certification Program, which he co-founded, and serves as an advisor to the Electrical Engineering Capstone Design course. In addition, he has recently founded Trellis-Logic LLC, an embedded systems consulting firm that specializes in digital media, robotics, and cyber-physical applications.

31 January 2012

HPC architecture and scaling

Frequently used acronyms

  • CUDA: Compute unified device architecture
  • GPGPU: General-purpose computing on graphics processing units
  • HPC: High-performance computing
  • MIMD: Multiple instruction, multiple data
  • NIST: National Institute of Standards and Technology
  • OpenCL: Open computing language
  • OpenMP: Open multi-processing
  • SIMD: Single instruction, multiple data
  • SPMD: Single process, multiple data

High-performance computing (HPC) is computing system design and architecture that can range from exotic shared-memory or cluster supercomputers to commonly available off-the-shelf scale-out distributed computing. Quite a range of architectural variants exist between the extremes of proprietary supercomputer architectures and what is known as warehouse-scale computing, which employs all off-the-shelf components for processing, I/O, and memory scaling, relying mostly upon software to scale rather than custom hardware.

So what does HPC have to do with cloud computing? While there are many definitions of cloud computing, the National Institute of Standards and Technology (NIST) notes that cloud computing must have on-demand self service, network access, pooling of resources, elasticity to scale up and down, and metered services. On-demand self service with elasticity to scale up resources for compute-intensive applications is why cloud-based HPC is catching on and removing barriers to entry for startup companies and researchers globally.

Historically, supercomputing has been the domain of scientists and engineers, employing scalable computing to simulation and mathematical transformations of large gridded data sets. The growth of data, data centers, and what have become data warehouses, along with the need for analytics to make sense of huge data sets, has resulted in rapid growth in distributed computing and scaling with commonly available computing hardware. Supercomputing has always had time-share remote access, job scheduling, and been viewed as a utility for research and development. Due to the high cost of proprietary hardware design, access has often been limited to research and less available to instructors and students. By comparison, warehouse-scale computing has emerged and grown alongside cloud computing.

Services available for cloud HPC

Before going further, a quick review of cloud computing services relevant to HPC is useful.

Software as a service (SaaS) provides access to applications from a portal.
  • SaaS makes applications available with little or no install footprint on the portal device (tablet, netbook, laptop) accessing the application.
  • SaaS for HPC is likely to include availability of simulations or advanced analytics applications that aren't easily installed on personal computing equipment. These simulations and applications include bioinformatics, finite element analysis, discrete event simulation, Monte Carlo analysis, and many more scientific and engineering research and development tools that also need access to large-scale data.
Platform as a service (PaaS) provides access to elastic scale-up and down from warehouse computing resource pools.
  • PaaS can provide access to a cluster of Linux® nodes for example, with message passing and threading tools—such as message passing interface (MPI) and open multi-processing (OpenMP)—at scales that may not be readily available for instruction.
  • PaaS is a pay-as-you-go plan for scalable computing, including MapReduce clusters and general-purpose computing on graphics processing units (GPGPU).
  • PaaS often includes more than one scalable architecture, including clusters, graphics processing unit (GPU) vector computing, or distributed computing with MapReduce software.
Infrastructure as a service (IaaS) is essential to scaling for HPC.
  • HPC often conjures up a number of processor cores, but as important, if not more, is the interconnection to scale the system, access to data sets, and ability to monitor long duration computing jobs.
  • Networking, storage, and I/O are critical to HPC (often collectively referred to as simply I/O). HPC frequently employs somewhat exotic and scalable high-speed I/O solutions such as InfiniBand for clustering, 10G/100G Ethernet for networks, nonvolatile memory (NVM) for storage, and scalable redundant arrays of independent disks (RAID) for storage. All are examples of I/O infrastructure needed as a service for an elastic platform service.
  • Storage as a service (STaaS) is often included as a specific infrastructure service that is more important to large data set HPC and gridded computing. Some examples are weather models, materials science, economics, and business analytics.
  • Infrastructure is costly to build, maintain, and manage.

Based on the NIST definition of cloud computing, it appears that the services (SaaS, PaaS, and IaaS) and characteristics of cloud computing have consensus agreement. However, the deployments range much more broadly than the current NIST definition. NIST defines private, public (all agree up to this point), hybrid, and community cloud. The author's experience has been private, public, virtual private, and personal cloud deployments. Either way, the deployments are defined more in terms of who uses the cloud services and limitations on access. NIST implies that community cloud is a cloud shared by more than one institution (perhaps a cloud of clouds) and that a hybrid is a combination of any of the aforementioned three (public, private, community).

NIST states that cloud computing must have:

  • On-demand service so a user can use computing resources without human assistance.
  • Broad network access from not only a web browser but also a wide range of mobile portals such as tablet, netbook, and smartphone devices.
  • Resource pooling so that processing, memory, I/O, networking, and storage are dynamically assigned according to customer demand.
  • Elasticity so that capability can be scaled up and down on demand and appear almost limitless.
  • Transparent metered services so consumers pay as they go and know what the cost will be.

To date, private cloud HPC has largely been community cloud computing, often focused on a specific application such as aerodynamics simulation, weather data analysis and models, or nanotechnology. Therefore, cloud HPC caters to that specific scientific or engineering community. Public cloud HPC is a relatively new development that initially focused on warehouse-scalable computing (cluster using off-the-shelf hardware) rather than traditional supercomputing, but this is starting to change.

One reason for the change to include warehouse-scale computing and supercomputing features in the public cloud is that multi-core and vector processing that was previously not readily available off-the-shelf is now widely available. Today, while the struggle for dominance continues in the supercomputing TOP500, more off-the-shelf contenders make the top 10. Currently, while a system with a proprietary interconnection holds the number one position, many others in the top 10 employ InfiniBand (and OpenFabrics Alliance), blades, off-the-shelf nodes, and GPGPU for vector processing offload. Most people won't use a top 10 supercomputer, but an off-the-shelf scalable public cloud might be the next best thing.

How to recast and reuse some basic algorithms for HPC scaling

Look at the scalable grid-threaded prime generator benchmark in the Downloads section. It's similar to the application from Part 1 of this series, but I added gridding of the natural number space so that primes can be validated or invalidated in parallel. This algorithm is interesting because you can make the validate or invalidate be highly concurrent, but the linear search for marked primes is inherently sequential because it must follow the gridded invalidation. The following code snippet from the download shows the sequential version of the invalidation:

for (j=2*p; j<MAX+1; j+=p) { set_isprime(j,0); }

The key function is set_isprime, shown in Listing 1, which uses multiples of two to mark non-primes using the sieve of Eratosthenes code from the first article in this series. The set_isprime function is a bitwise map of all valid and invalid prime number locations from 0 to 10,000,000,000 in the example download code. If you can call this with a grid of non-conflicting threads (in terms of locking), then the sequential code in the simple loop can be made parallel. Note that you don't need the locks for the sequential version, but they cause no harm to nor slow down simple sequential looping.

Listing 1. Prime invalidation function
int set_isprime(unsigned long long int i, unsigned char val)
    unsigned long long int idx; unsigned int bitpos;
    idx = i/(CODE_LENGTH); bitpos = i % (CODE_LENGTH);

    if(val > 0) {
        sem_wait(&updateIsPrime[idx % NUM_LOCKS]);
        isprime[idx] = isprime[idx] | (1<<bitpos);
        sem_post(&updateIsPrime[idx % NUM_LOCKS]);
    else {
        sem_wait(&updateIsPrime[idx % NUM_LOCKS]);
        isprime[idx] = isprime[idx] & (~(1<<bitpos));
        sem_post(&updateIsPrime[idx % NUM_LOCKS]);

The thread-gridded version includes fine-grain locking (because the Booleans for primes are bit-packed and require test and set) and Portable Operating System Interface (POSIX) threads. The setup for this (found in the download) is somewhat involved, and you might simplify it by using OpenMP. However, the core thread grid is simple and includes thread creation and thread join as you can see in the revised code snippet for setting prime Booleans shown in Listing 2.

Listing 2. Prime invalidation thread grid
for (thread_idx=0; thread_idx < NUM_THREADS; thread_idx++)
    threadarg[thread_idx].p=p; threadarg[thread_idx].j=2*p;
    pthread_create(&threads[thread_idx], NULL, invalidate_thread, &threadarg[thread_idx]);
for (thread_idx=0; thread_idx < NUM_THREADS; thread_idx++)
    if(pthread_join(threads[thread_idx], (void **)&final_thread_j) < 0)
        { perror("pthread_join"); exit(-1); }

How do you know if you'll benefit from HPC scaling?

One big challenge with scaling algorithms is how to make a sequential section parallel. Numerous hardware options exist for processing data that has been gridded for transformation by threads, including multi-core processor architectures, GPGPU (which likewise include many specialized vector processing units), as well as interconnected clusters of computing nodes. The key is how much overhead you require to ramp up the parallel section, the percentage of the core algorithm that you can make parallel, and the ramp-down time to collect and organize results. In the large prime code example, you see that it's difficult to make locking on the Boolean critical section (using semaphores) and the prime search function parallel. According to Amdahl's law, this limits your potential speedup regardless of what hardware you throw at this application. As shown in Figure 1, with some tweaking, you can make the example code run more parallel (after a ramp up) and saturate multiple processors (in this case four—a dual-core i7 hyper-threaded system).

Figure 1. Saturating multi-core shared memory computers
Multi-core saturation

The largest shared memory multi-core systems are many of the supercomputers found in the TOP500. Even Amazon Elastic Compute Cloud (EC2) is limited to about eight cores per node (typical four-socket, dual-core server found off-the-shelf). So, to get more scaling, you either use a vector coprocessor such as a GPGPU, message passing on a cluster (high-speed network interconnection such as InfiniBand), or MapReduce on a distributed (Ethernet networked) warehouse-scale cluster. This means that the added latency and overhead of breaking up the grid and sending code to the data (as is done with MapReduce) or data to each compute node (as is done in clusters) and collecting back results may not be worth the trouble (especially for an algorithm such as finding primes). So, the simplest approach to scaling is shared memory multi-core and offload vector processors with shared memory, perhaps extended at a much less fine-grain level to provide job-level parallelism in a cluster or distributed system.

Why multi-core and vector processing offload will drive cloud HPC

Multi-core and vector computing off-the-shelf has evolved rapidly in the past decade. You could debate whether this is a new direction in computer architecture or a new direction in the types of applications that users want to run and the problems they want to solve. As shown in Figure 2, Michael J. Flynn recognized four basic computer architectures back in 1966 based on the cross product of data and instruction parallelism.

Figure 2. Flynn's computer architecture taxonomy
Flynn Architecture

Why have Flynn's MIMD and SIMD architecture categories gone off-the-shelf?

Applications today are much more data-driven than ever, and the ability to MapReduce and process gridded data sets (often referred to in the past as embarrassingly parallel) are more prevalent. As such, high interest in scaling multiple cores both with shared memory and as clusters that can share data through message passing has grown quickly. Likewise, with the realization that the 4D (three physical dimensions plus shading/texturing) gridded data transformation used by GPUs for rendering can do wonders for scientific gridded data processing, GPGPU has entered into off-the-shelf HPC. This new architecture resulted in the definition of a new language, OpenCL, which is supported on GPUs and many-core architectures support (see Resources).

Large shared memory symmetric multiprocessor (SMP) uniform memory access (UMA) machines are costly to build. In such a machine, every core has equally high performance (low latency) access to one giant memory space, and you can allocate threads of execution equally to any core in the system with minimal overhead. As such, most off-the-shelf systems remained single instruction, single data (SISD) while multiple instruction, multiple data (MIMD) and single instruction, multiple data (SIMD) systems remained in the realm of exotic supercomputers through the end of the twentieth century. While multiple instruction, single data (MISD) still remains in the realm of exotic high availability and reliability embedded and enterprise systems that must have high fault tolerance, SIMD and MIMD have become available off-the-shelf and even on mobile smartphone and tablet portals.

Two things happened to make SIMD and MIMD widely available off-the-shelf in the new millennium. First, uniprocessor core clock speeds hit a plateau (high clock speeds either melt the silicon or risk non-deterministic logic operations). At the same time, multi-tasking in window-based operating systems became the standard. So, dual-core and then multi-core emerged to sidestep the clock plateau and, at the same time, give users a better interactive, multitasking desktop. Second, graphics used on the desktop—especially those that render 3D images—drove a new industry in graphics processing offload, known as GPU, which handles rendering without the processor cores. Users of GPUs noticed that they were low cost and highly specialized for frame buffer gridded data transformation in four dimensions. These GPUs include numerous stream multiprocessors that use vector instructions (operations on more than one 32-bit word at a time). Therefore, GPUs can transform a high resolution image and render it with high performance. Early in the new millennium, many programmers noticed the generic compute parallelism that GPUs provided for gridded data transformation, and, as such, single process, multiple data (SPMD) computing was born. Nvidia with the compute unified device architecture (CUDA) for SPMD programming first formalized this new programming paradigm. Today, this is enabled on a wider variety of multi-core vector coprocessors, and you can program in OpenCL as well as CUDA. SPMD is a simple variation of MIMD and SIMD that allows for a single program to make use of multiple cores providing vector processing.

MapReduce and warehouse-scale computing

MapReduce is an approach used by Apache Hadoop and other distributed cluster scaling software solutions to divide and conquer data processing problems, especially with large data sets. It is essentially job creation, tracking, and coordination of completed jobs to provide concurrency within and off-the-shelf cluster of computing nodes. In the case of Hadoop, the MapReduce job management is coupled with the Hadoop Distributed File System (HDFS). Hadoop is in wide use for analytics for web-based services, social networks, and a wide variety of big data processing applications. The data processing is large-grain parallelism compared to vector processing and message passing cluster computing but works well by making use of low-cost datacenter equipment, including direct attach storage, commonly available rack-based servers, Ethernet (gigabit or 10G), and the Linux operating system.

One type of embarrassingly parallel problem that is easily adapted to MapReduce and benefits greatly is digital media high-resolution image processing, as shown in Listing 3.

Listing 3. Image processing thread grid
for(runs=0; runs < 1000; runs++) {
    for(thread_idx=0; thread_idx<NUM_ROW_THREADS*NUM_COL_THREADS); thread_idx++) {
        if(thread_idx == 0) {idx=1; jdx=1;}
        if(thread_idx == 1) {idx=1; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 2) {idx=1; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 3) {idx=1; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 4) {idx=IMG_H_SLICE; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 5) {idx=IMG_H_SLICE; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 6) {idx=IMG_H_SLICE; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 7) {idx=IMG_H_SLICE; jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 8) {idx=(2*(IMG_H_SLICE-1)); jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 9) {idx=(2*(IMG_H_SLICE-1)); jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 10) {idx=(2*(IMG_H_SLICE-1)); jdx=(thread_idx*(IMG_W_SLICE-1));}
        if(thread_idx == 11) {idx=(2*(IMG_H_SLICE-1)); jdx=(thread_idx*(IMG_W_SLICE-1));}
    for(thread_idx=0; thread_idx<NUM_ROW_THREADS*NUM_COL_THREADS); thread_idx++) {
            if((pthread_join(threads[thread_idx],NULL)) < 0) perror("pthread_join");
    printf("frame %d completed\n", runs);


While this example is simply threaded in a shared-memory implementation, if you run it, you can see that most of the time is spent loading the first frame (which is 12 million pixels) into memory. This demonstrates why it is better to migrate Java™ code to the data rather than data to the POSIX C code. Furthermore, after the frame loads, the thread grid does scale in parallel to multi-core. However, you could scale this up with considerably more efficiency if you rewrite the thread grid for a GPGPU coprocessor. The grid is enumerated for 12 threads, so the grid concept is simple to understand.

The sharpen_thread itself is amenable to vector processing (as you can see if you download and run the code). This code is ideal for MapReduce using Hadoop where the frame transformation code is run on a distributed cluster and code is migrated to each node that locally stores the digital media frame data to eliminate an I/O bottleneck and reduce the impact of image loading ramp up. While Hadoop is without question a valuable tool for data warehouse analytics, it may not have the fine-grained parallelism required to provide significant speedup for traditional scientific computing. This all depends on the problem and how much MapReduce ramp-up and ramp-down time is needed to distribute the computation (normally migrated to the data) in the Java programming environment.

Figure 3 shows a simple plot of Amdahl's law for speedup.

Figure 3. Amdahl's law for speedup based on parallel portion
Ahmdahl law

In the ideal case with minimal ramp-up and ramp-down time, you achieve linear speedup with parallelism. However, for many algorithms the sequential code to simply start the parallel section of the kernel code that benefits from threading and vector processing can be substantial. Likewise, to gain significant speedup, the number of parallel cores providing vector processing may need to be scaled beyond just one node. Therefore, scaling may require more traditional HPC cluster scaling methods such as the use of Open MPI, OpenMP, and vector processing using open computing language (OpenCL) or CUDA in a more tightly integrated InfiniBand cluster with Peripheral Component Interconnect Express (PCIe) coprocessors. Scaling even algorithms for speedup remains both a software and a hardware challenge, but programmers today have more tools to employ, and system designers have much more to choose from off-the-shelf.

Example multi-core scaling using Amazon EC2

To examine the value of cloud-based HPC, I took the high resolution image for sharpen benchmark (see Downloads), which is a simple, gridded image processing example, and ran it both on my dual-core i7 laptop (which is a mobile workstation) and on an Amazon EC2 Spot Instance with 64GB of working memory, eight cores, and 26 elastic compute units. Speedup on an embarrassingly parallel data transformation algorithm like this should scale well. What I found was that with two cores (four if you count hyper-threading) compared to eight cores, I got approximately 1.25x speedup, which, based upon Amdahl's law, is about what I would expect because the I/O ramp-up and ramp-down time is high. The following output snippet in Listing 4 shows the results from my Core-i7 laptop.

Listing 4. Results from my Core-i7 laptop
		frame 999 completed
		starting sink file Cactus-12mpixel-sharpen.ppm write
		sink file Cactus-12mpixel-sharpen.ppm written

		real	5m33.146s
		user	17m31.794s
		sys	0m50.391s

By comparison, on Amazon, I could see all eight processors on top showing 90% or greater loading during the processor-intensive phase of this test. I didn't optimize the I/O, but the same simple read/write file I/O was done on both systems. Overall, if I can get 25% speedup with little optimization work and pay less than US$1 per hour to turn an eight-hour job into a six-hour job, this could mean finishing my day's work and getting home on time for US$6 spent.

The output from the Amazon EC2 run in Listing 5 shows that my five minutes and 33 seconds of real-time waiting on results was reduced to four minutes and 12 seconds. It is interesting that more system time was used on Amazon (probably for EC2 management overhead), but overall this provided 25% real-time speedup with no code changes on my part.

Listing 5. Output from the Amazon EC2 run
		frame 999 completed
		starting sink file Cactus-12mpixel-sharpen.ppm write
		sink file Cactus-12mpixel-sharpen.ppm written

		real    4m12.228s
		user    13m55.266s
		sys     2m11.944s

With a spot instance price of US$0.755/hour for an m2.4xlarge running 64-bit Red Hat Enterprise Linux (RHEL) in my region during the evening hours, this seems like a bargain and well worth it if I have hours of post-production processing to do on a frame sequence going into a short feature film. Writing this article and developing and testing the included code took me approximately two hours, cost me US$18.51 on HPC. I spent some time on my laptop too, but overall this was a bargain. With a bit more practice, use of spot instances with competitive bids, and less time spent with instances idling, I could even lower my cost. Figure 4 shows my total cost.

Figure 4. The cost of a cloud-based HPC for education paper
Cost of Cloud HPC

Cloud-based HPC economics and availability

In Chapter 6 of Hennessy and Patterson's Fifth Edition of Computer Architecture, A Quantitative Approach (see Resources), they provide an analysis showing that with both capital expenditure (CAPEX) to build a warehouse-scale computer plus operational expenditure (OPEX) to keep it running, you can hit an optimal point where the cost of running 1000 servers for one hour is equivalent to running one server for 1000 hours. This is an astonishing conclusion because it provides cloud-based HPC users the simple choice of owning and waiting for jobs to complete or paying as they go and accelerating their results without the burden of ownership. While this seems quite believable based on their analysis using Amazon EC2, it is not yet clear that this is also true of traditional supercomputers, which employ lower latency higher bandwidth interconnections (for example OpenFabrics Alliance protocols such as InfiniBand) or even more proprietary interconnection like the Cray XT6 SeaStar2+, but it seems hopeful. The future for Converged Enhanced Ethernet (CEE), InfiniBand, and Data Center Ethernet (DCE) is pointing toward a common fabric where supercomputing capability will become more available as a cloud service.

The future of cloud-based HPC

This article makes an argument for cloud-based HPC based on the recognition that off-the-shelf availability of SIMD vector processing and MIMD elastic compute clusters with on-demand service, pay-as-you-go plans, broad network access, and resource pooling (all cloud features) will expose HPC to a broader set of users beyond research and development scientists and engineers—including students and their instructors. The alternative is, of course, for institutions to build their own supercomputers and clusters. Even then, private cloud access to those resources is beneficial because HPC resources, even off-the-shelf, are costly to fund and administer, and, therefore, are most beneficial if they are highly used.

Future articles in this series will provide more concrete examples of cloud-based HPC and how to make use of GPGPUs, Linux clusters, and MapReduce software such as Hadoop.


Grid-threaded prime generator benchmarkhpc_cloud_grid.zip3KB
High resolution image grid parallel benchmarkhpc_digital_media_cloud_grid.zip628KB



  • Read Cloud computing fundamentals (developerWorks, December 2010) to learn more about cloud computing.
  • NIST provides a concise Definition of Cloud Computing, which appears in the Institute of Electrical and Electronics Engineers (IEEE) Spectrum article, A Cloud You Can Trust, in the December 2011 edition.
  • The classic Computer Architecture, A Quantitative Approach, 5th Edition by John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2012, Chapter 6 includes great detailed background information about HPC, warehouse-scale computing, and a detailed treatment of SIMD and MIMD architectures.
  • The Top500 supercomputing website provides a description of the 500 fastest computers on earth and an overview of the benchmarks such as LINPACK and BLAS used to compare their computational capabilities. The presentation of the Top500 comparing statistics provides an interesting insight into who has supercomputing resources and how this impacts business and the economy. These Top500 Birds of a Feather slides from the Supercomputing Organization 2011 conference provide a nice summary of the value and importance of supercomputing.
  • The OpenFabrics Alliance is helping to define standards for 10G (10 Gigabits/sec with 8B/10B encoding) and up to 100G (100 Gigabits/sec with 64B/66B encoding) to knit together InfiniBand, Converged Enhanced Ethernet and Data Center Bridging in the IEEE 802.1 working groups for HPC and warehouse-scale computing.
  • You can use Massachusetts Institute of Technology's (MIT's) StarCluster on Amazon EC2, enabling use of Spot Instance time to reduce cost and fully leverage cluster scaling. This video by Chris Dagdigan on YouTube is helpful to get you started.

Get products and technologies

  • Software needed for HPC is often open source. Key tools include the OpenMP application programming interface (API), OpenMPI, CUDA for Nvidia GPUs and GPGPUs such as Fermi, and OpenCL, as well as numerous Linux operating system and GNU development tools and specialized MapReduce tools such as Apache Hadoop.
  • PaaS HPC public cloud options today include: Amazon EC2 for high performance multi-core and GPGPU computing and HPC applications, Sabalcore Cloud HPC, Penguin Computing Linux Clusters, and Platform Computing PaaS HPC. The list is growing with options for startup companies and academic institutions that may not have affordable access to community HPC, helping to remove barriers to entry for research and development.
  • IBM provides Cloud-based HPC management tools for private and community HPC clouds, Data Warehouse solutions for warehouse-scale computing and big data analytics, and numerous supercomputing solutions on the IBM Deepcomputing website.
  • I mostly used top and basic Linux monitoring tools to measure performance. However, I also used SystemTap, a great tracing tool for the Linux kernel (and user space), Sysprof, and OProfile. You can easily install any of these on your computer or on a cloud HPC Linux instance with Yum or YaST, depending on the flavor of Linux being run on your instance or cluster. Almost all of these tools involve either building and booting your own Linux kernel or finding and installing the appropriate debuginfo packages to match your kernel (found with the uname -r command).
  • Find evaluation software. Download a trial version.


developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks

  • developerWorks Premium

    Exclusive tools to build your next great app. Learn more.

  • Cloud newsletter

    Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud news.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

Zone=Cloud computing
ArticleTitle=Cloud-based education, Part 2: High-performance computing for education