Multicore CPUs and the concurrency changes they bring

Why thread-based application parallelism is trumped in the multicore era

Multicore chip architectures have brought little improvement to individual core performance. This continuing trend shifts the burden of maximizing the use of hardware resources to the developers of operating systems, programming languages, and applications. Many in the application-development community rely on thread-based concurrent programming to implement application parallelism. This article explains why thread-based programming is not the best approach for application parallelism in the multicore era.

Vasudevan Thiyagarajan (vasu.thiyagarajan@live.com), Architect, Royal Caribbean Cruises Ltd.

Vasu Thiyagarajan works for Royal Caribbean Cruises Ltd. as an IT architect. Previously, he was part of IBM Software Labs, Bangalore. His primary areas of interest include building highly scalable systems and distributed computing.



31 July 2012

Also available in Chinese Russian Japanese

Moore's Law — Gordon Moore's 1965 prediction that the number of components per integrated circuit will double every 18 to 24 months — has held true, and it is expected to remain true until 2015-2020 (see Resources). Until 2005, CPU clock rates also improved consistently, which by itself was sufficient to improve the performance of all applications executing on those CPUs. The application-development community enjoyed a free ride with respect to performance improvement while making little or no investment in algorithmic improvement.

Since 2005, however, clock-rate increases and transistor-count increases have been diverging. Because of the physical nature of processor materials, clock rates stopped increasing (and even dropped), and processor makers started packing more execution units (cores) into a single chip (socket). This trend — which seems likely to continue for the foreseeable future — has started to put upward pressure on the application-development and programming-language-development communities, in two broad senses:

Where do we go from here?

Find out why actor concurrency is becoming an increasingly popular and necessary alternative to traditional Java concurrency in the knowledge path Actor concurrency for Java applications (developerWorks, May 2012).

Andy Glover interviews concurrency expert Alex Miller in this developerWorks podcast.

  • Simply upgrading to a more powerful CPU no longer results in pre-2005 rates of performance increase for a single-threaded application. Single-threaded applications perform the same no matter how many cores are in the CPU. That is, throughput per core is more or less the same, regardless of how many cores the CPU has (assuming no breakthrough occurs in automatic parallelization techniques on the compiler, virtual-machine, or operating-system level).
  • Upgrading to multicore CPUs will benefit only incremental load on the system, not the existing load.

The only way to exploit the available CPU cores efficiently is through parallelism. So far, parallelism is mainly being used by operating systems at the process level to provide a seamless multitasking, multiprocessing experience. On the application-development side, thread-based concurrent programming is the predominant mechanism for implementing parallelism.

Thread-based programming model

A thread is a lightweight process and the smallest unit of execution scheduled by an OS. All threads within a process share the same address space in memory, so they share objects in memory. Technical details about how threads work is beyond the scope of this article.

Thread-based parallelism has these advantages:

  • It is a well-established programming model.
  • The application-development community has a solid understanding of how threads are created, scheduled, executed, and managed.
  • Developers are trained to think of algorithmic development in a sequential manner. The threading model simply extends the same approach for parallelism.

However, the problems with thread-based application parallelism outweigh its advantages. This article presents some reasons why explicit thread-based application parallelism might not be the best way to utilize CPU cores and why we need a different programming paradigm.

Call-stack depth

The call stack is an internal structure maintained by the OS or virtual machine to handle all method invocations. Every method call within the thread execution pushes one stack frame (consisting of details about the current method call, such as parameters, return address, and local variables).

Figure 1 shows the internals of method invocation:

Figure 1. Call-stack internal structure and growth
Image showing CallStack internal structure and growth

No matter how you modularize an application into multiple logical layers (such as controller layer, facade layer, component layer, and data access object [DAO] layer), a thread is the ultimate weaver at runtime, and it has only one stack. The call stack is an awesome invention for handling source-code modularization at runtime. But as an application's complexity grows and load on the system increases, the current call-stack structure model limits application scalability, and it has inherent problems relating to memory size and object reachability.

Object reachability

Another problem with the deep call stack is that object references can be held up in the call stack but never used. In Figure 1, for example, it is unlikely that all the local variables and parameters of all the methods in the call stack are needed when the thread is executing the deepest method in the execution flow. (For example, when a thread executes DAO-layer code, it is unlikely that the application needs all of the local parameters and variables in the call stack pushed by the servlet-layer, controller-layer, facade-layer, and other layer method calls). However, it won't be released or garbage collected, because it contains live references.

The Java™ call-stack implementation is designed to release all its references automatically upon method-call return. This might be acceptable when the JVM is not under high load. But it can be a problem when the JVM is operating with a high number of active threads. For example, if each thread holds up to 5MB of unused live references in the call stack, and 100 threads are active, the JVM will be unable to garbage collect 500MB of heap space because it is still being referenced by call stack-variables and parameters. On a 32-bit machine, this could amount to at least 25 percent of all available memory for that JVM, which is a considerable size.


Shared objects

Another critical problem with thread-based parallelism is the synchronization effort that is due to the mutability of objects shared by multiple threads, as shown in Figure 2:

Figure 2. Shared memory
Image showing how objects are shared among threads

Though the concept of synchronization is nothing new and has been widely adopted, it penalizes the performance of the application because the lock-acquiring sequence might force the thread to wait or sleep till it is released, which will internally trigger a thread-context switch. A context switch generally slows down thread execution. Also it flushes out all pipeline instructions and cache within the core. In a JVM with lots of parallel threads, synchronization might cause frequent thread context switches that are due to synchronization and lock.


Sequential programming

Sequential programming is not necessarily a problem with threads themselves, but it is related to the way an application uses them. The logical concept of the OS process was devised in the early days of computing for executing the instructions (in a user-submitted job) sequentially. But the sequential-programming mindset still prevails, even though the complexity of some processes has increased manyfold since then. As complexity has increased, various system layers (back end, middle tier, front end) have come into existence. But within a layer, application use-cases are still executed in a sequential manner with a single thread as the weaver of all logic across a variety of components.

You could compare this to manufacturing processes in the era before Henry Ford's assembly line was introduced. Then, a single worker or team of workers would create an entire product. An assembly line enables workers to concentrate on a specific subtask within the overall manufacturing process. It improves productivity manyfold by saving the time workers would otherwise spend moving through the stages of product manufacturing.

A modern-day analogy to the assembly line is customer-order processing by a fast-food restaurant. A predefined number of workers, each specialized in a set of subtasks, process the order, with each worker doing only a portion of the overall work. Once that person's part of the work is done, the semi-finished product is handed to the next worker in the chain, and so on until the final product is complete. In contrast, consider a system in which each worker handles one customer at a time from start to end. Both are valid ways of executing orders, but the fast-food system is more productive. A single worker who processes an entire order will spend too much time moving from place to place instead of actually making the product. Movement among workers creates other problems, such as space contention and time delays.

Now think of the way a modern JEE application server executes a user request. It allots one dedicated thread for a single user request. As illustrated in Figure 3, that thread executes all the instructions starting from logging, database interaction, web service invocation, network interaction and logic computation, and so on:

Figure 3. Thread flow
Image showing one thread executes all different logic

No matter how well the source code is modularized in terms of controller, model, view, facade, and other layers, it is executed by a single thread. This type of execution internally creates lots of hardware resource contention such as context switch.


Conclusion

Multithreading is an excellent way of utilizing underlying CPU resources as efficiently as possible. But as systems have evolved, the development and OS communities have extended the use of multithreading for application-level parallelism as well. The application-development community started using thread-based programming to execute all application logic in a sequential manner. Since multicore CPUs started gaining ground, with the numbers of cores increasing gradually, sequential explicit thread-based programming has become less efficient.

Scalable, high-performance applications running on multicore hardware require a parallelism methodology that breaks application logic into slices of multiple interdependent work units and chains them together transparently (as opposed to tying them together explicitly with single thread), so that each individual work unit can execute efficiently.

Just as the assembly line revolutionized the manufacturing process and introduced efficiency in every layer, the right future programming model will change the way we design application software. One such abstraction model, actor-based programming (see Resources), divides the entire application into multiple slices, so that underlying cores can be assigned to these slices and executed in parallel in an efficient manner.

Disclaimer

All opinions and views in this article are solely mine and not necessarily those of my employer.

Acknowledgment

I would like to thank to my colleagues Jesus Bello and Olga Raskin for their valuable suggestions.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=827591
ArticleTitle=Multicore CPUs and the concurrency changes they bring
publish-date=07312012