The new input/output (NIO) library, introduced with JDK 1.4, provides high-speed, non-blocking, asynchronous I/O capabilities in standard Java programs. The asynchronous I/O allows applications to read and write data without blocking. Normally, when an application makes a read () call, the code blocks until there is data to be read. Likewise, a write () call blocks until the data can be written.
Asynchronous I/O calls, on the other hand, do not block. Instead, an application registers the I/O events -- the arrival of readable data, a new socket connection, and so on -- and the system tells you when such an event occurs.
One of the advantages of asynchronous I/O is that it allows an application to handle I/O operations from a great many inputs and outputs at the same time. It also enables an application to have more CPU time available to perform other processing while the I/O is taking place. This article first demonstrates the overhead incurred in the traditional polling mechanisms, and shows how the pollset interface improves the performance and scalability. Then, it shows the performance improvement measured on a JEE5 application server.
When dealing with multiple file descriptors, an application typically sets each file descriptor as non-blocking (as shown in Listing 1) and issues a read on one file descriptor at a time.
Listing 1. How to configure an I/O channel as non-blocking
DatagramChannel channel1 = DatagramChannel.open(); channel1.configureBlocking(false);
If data is present, it is read and processed. If there is no data to read, a read call returns immediately. You then do the same thing for another file descriptor. After you wait for some amount of time, you start over again by repeatedly reading each file descriptor. This method is called busy-wait polling.
The busy-wait polling method has a severe impact on the efficiency due to the following two problems:
- It wastes a lot of CPU cycles just to iterate read () system calls when there is nothing to read on the given file descriptor.
- It cannot respond to a file descriptor immediately when the data becomes ready.
This busy-wait polling method should be especially avoided on a multitask system.
To resolve these problems, a poll() API was introduced on UNIX® System SVR3 and has been a part of POSIX standard. Basically, an application provides the kernel with a list of file descriptors that it needs to monitor for read/write/error conditions as well as a timeout value. The kernel registers the process/thread with the associated device's select function and puts the process/thread to sleep. Once the associated device is ready or a timer has been expired, the kernel wakes up the registered process/thread. This method dramatically reduces the overhead due to I/O; it eliminates a large number of system calls and data copies between kernel and user spaces. Furthermore, the application can respond to an I/O event immediately.
The Java NIO library has introduced a class "Selector" to support the same API from Java applications. Any Java application can open a selector to obtain an associated data structure with an open() call, as shown in Listing 2.
Listing 2. How to obtain a selector object
Selector selector = Selector.open();
The application then registers channels (file descriptors) and interested operations to the selector with a register () call on the channel. For example, if the application wants to know when a particular channel becomes ready for read, it can register the channel into the selector against a read operation as shown in Listing 3.
Listing 3. Register channel into selector
Figure 1. Traditional poll() approach
Figure 1 shows that the selector internally keeps these details until the application calls a select() method, as shown in Listing 4 on the selector. The selector then copies the channels and interested operations into the kernel space and let the kernel do the actual polling for the application.
Listing 4. How to initiate polling
A select() call returns a list of file descriptors for which at least one registered event has occurred. The application can then perform an I/O operation only on those file descriptors. This method dramatically reduces the overhead due to a large number of system calls and data copies between kernel and user spaces.
The selector internally calls a native poll() function (shown in Listing 5), which provides a mechanism for multiplexing inputs and outputs over a set of file descriptors:
Listing 5. The signature of poll() API
int poll(struct pollfd fds, nfds_t nfds, int timeout);
The traditional polling method, however, has a scalability issue; it does not scale well for a large number of file descriptors. The fundamental problem is that the amount of work to be done for each poll operation scales linearly with the number of file descriptors. There have been many new APIs proposed to improve the scalability, such as /dev/poll, real-time signals, I/O completion ports, /dev/epoll, and kernel queues. There have been considerable debates as to what API is the best long-term solution (see [POLLCMP]).
What aspects of poll() affect the scalability?
- Each poll() call provids a list of file descriptors to be polled. The list is copied into the kernel space for each call. Red-colored events in Figure 1 show the redundant copy.
- Polling an object involves first establishing a hold count on the file descriptor and then calling through the select fileop associated with the file descriptor.
- The primary path length difference between asynchronous and synchronous polls is the allocation and eventual clean up of control blocks.
- As a last step in a poll operation, all control blocks are cleaned up. Each control block must be removed from an object bound to the block. This requires the poll method to lock the object.
If poll() is called in a loop, these expensive system calls involved in polling can dramatically affect the overall performance when a large number of file descriptors are monitored.
To make poll() scalable with a large number of file descriptors, two optimizations are provided by the AIX pollset interface. The first is to reduce the amount of information transferred between kernel and user spaces on each poll operation, as shown in Figure 2. The pollset interface creates and maintains a file descriptor set and its interested events in the native (kernel) pollset layer. An application then registers the file descriptors and interested events directly into the native pollset layer. Unlike poll(), the pollset interface does not require the selector to copy the entire file descriptor set each time when select() is called. Instead, it copies only the events that are newly registered after the previous select() call.
Figure 2. Pollset() approach
The second optimization is to use a pollcache mechanism within the kernel. It maintains the file descriptor state on the requested file descriptor set across system calls. The state is tracked by polling busy file descriptors at the beginning of each poll operation. The state of idle file descriptors is known since the pollcache service is notified once when it changes.
Figure 3. Pollcache internal
Figure 3 shows components in a pollcache and their relationship. The pollcache manages a potentially large set of file descriptors. Each file descriptor in the set is described by a pollcache control block -- pccb. Each pccb can be located in the pollcache based on a file descriptor hash. A pending list is maintained to identify pccbs that have had a recent state transition. Each subsystem that supports select/poll will register with the pollcache. When the state of a file descriptor changes, the subsystem notifies the pollcache, which triggers a state transition in the pollcache. To avoid the scaling problem with traditional poll()/select() which needs to examine all the selected file descriptors, a pollcache has state transitions to only move 'busy' pccbs to an event list. In this way, a poll operation does not need to visit all the pccbs in the pollset. Only control blocks that have been added to an event list are serviced. The worst-case scenario occurs when the number of busy file descriptors is close to that of the entire selected file descriptors and the number of file descriptors is fairly large. In that case, the pollset approach does not improve the performance significantly over traditional select/poll approaches.
IBM® JDK supports the pollset interface starting from 6.0 Service Refresh 5 onwards. There is no change required from the application perspective to enable the pollset interface. The Java.nio.SelectorProvider method, by default, opens a pollset selector if it finds that the operating system supports the pollset interface. The NIO's pollset selector uses the following native pollset APIs (shown in Listing 6) to improve the application performance.
Listing 6. Native pollset interface set used by the NIO library
pollset_t ps = pollset_create(int maxfd); int rc = pollset_destroy(pollset_t ps); int rc = pollset_ctl(pollset_t ps, struct poll_ctl *pollctl_array, int array_length); int nfound = pollset_poll(pollset_t ps, struct pollfd *polldata_array, int array_length, int timeout);
As mentioned earlier, a pollset selector creates a native pollset structure when an application opens the selector. The selector then registers file descriptors and interested events into a native pollset structure when the application registers the channel. This means that for every event registration the selector call has to make two mode switches. The first switch is from a Java API layer to a Java native Interface (JNI) layer. The second switch is from a JNI layer to a kernel space. These switches can affect the performance if an application registers a huge number of channels.
Figure 4. Pollset() - Bulky update
In order to avoid such a excessive number of mode switches, the selector method internally maintains a data structure (as shown in Figure 4)to store file descriptors to be registered temporarily until their count reaches a certain threshold value. Note that the selector will register all the file descriptors into the native pollset layer when the application calls select() on the selector even though the registration count does not reach the threshold value.
Our experiments focus on an Ajax scenario, which allows a client browser to communicate with the server asynchronously. This technique can provide responsive user interfaces by typically displaying and updating a small popup window inside a browser window much more quickly than traditional Web pages where a full Web page is downloaded and updated at a time. Thus, while the main code of the application is running on the server, the Ajax technique can enable user experiences as if it is running on the client browser.
Since each Ajax request causes a very simple transaction on the server to look up a database record, the server spends a relatively large portion of CPU cycles for the operating system (see [ISPASS]). This characteristic motivated us to use the pollset API for increasing the throughput performance by reducing the system time. The throughput performance of Ajax requests is an imporant metric because it can directly affect the user experience when the server is heavily loaded.
Figure 5. Petstore environment
Figure 5 illustrates our experimental environment, which consist of three tiers: emulated clients, an application server, and a back-end database server. For the emulated clients, we used eight Linux®-based blade servers, which execute a client emulator based on an open-source Grinder tool (see [GRIDER]). For our Ajax experiments, we emulated 1280 clients in total, each of which repeats executing a loop where the client selects a pet item randomly and sends an Ajax request to the application server to retrieve the information on the pet. For the application server, we used a Glassfish application server (see [GLASSFISH]) running on IBM BladeCenter® JS22 server with 4-core POWER6™ processors running at 4GHz. For the back-end database server, we used a MySQL database running on IBM BladeCenter HS21 with 8-core Intel Xeon E5320 processors.
We evaluated the performance benefit by using the pollset API in a Java driver. In our experiments, we focused on an Ajax request, which we described previously, since this scenario stresses a client-server interaction pattern commonly seen for emerging Web 2.0 applications. Figure 6 shows the throughput performance results we measured with two drivers: one using poll() and the other using pollset(). The Y-axis shows the number of client requests per second. The X-axis shows the variation of threshold value for the number of buffered file descriptors in our driver prototype with pollset API. Our results have shown that the driver using the pollset API improves the throughput performance up to 13.3% over the original driver using the poll API.
Figure 6. The throughput performance of two drivers, one with poll() and the other with pollset().
We further analyzed the system time by using a curt command, which is a part of AIX tracing tools (see [AIX TOOL]) to understand how much we can reduce the system time with the pollset API. Figure 7shows the number of pollset_ctl() and pollset_poll() system calls per msec when we change the threshold value for the number of buffered file descriptors. As we increase the threshold value, the number of calls to pollset_ctl() decreases because each pollset_ctl() can process more sockets.
Figure 7. Number of system calls per msec
Figure 8 further shows the CPU time ratio for poll(), pollset_ctl(). and pollset_poll() APIs. While the original driver (shown in the left-most bar in Figure 8) spends 5.3% of the CPU time for calling poll(), shows the number of pollset_ctl() and pollset_poll() system calls per msec when we change the threshold value for the number of buffered file descriptors. As we increase the threshold value, the number of calls to pollset_ctl() decreases because each pollset_ctl() can process more sockets.
Figure 8. Time spent on CPU.
This article demonstrated the performance advantages of using the pollset interface over poll with a pet store application. We have also shown that the pollset interface can effectively reduce the amount of data transferred between kernel and user spaces because it queries only the busy file descriptors. The pollset interface can perform best when the file descriptor set is not frequently updated.
[ieee1003]: Learn about poll() - input/output multiplexing in Open Group Base Specifications Issue 6, IEEE Std 1003.1, 2004 Edition
[usenix.org]: usenix.org: Learn
system call by Gaurav Banga
- [POLLCMP]: Learn comparison of performance of
different poll implementations
- Learn about Merlin's
- Learn javadoc to get more information on
High-Performance I/O with Java NIO
- You'll find hundreds of articles about every aspect of Java programming in the developerWorks Java technology zone
- [PETSTORE]: Learn Java PetStore 2.0 Reference Application.
- [GRIDER]: Learn about
Grinder Load Testing Framework
The Glassfish Open Source Application Server
Moriyoshi Ohara, Priya Nagpurkar, Yohei Ueda, and Kazuaki Ishizaki, "The Data-centricity of Web 2.0 Workloads and its Impact on Server Performance", The 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009), pp 133-142, April 26-28, 2009
- [AIX TOOL]:
AIX 5L Practical Performance Tools and Tuning Guide
Get products and technologies
Participate in the AIX and UNIX forums:
- AIX Forum
- AIX Forum for developers
- Cluster Systems Management
- IBM Support Assistant Forum
- Performance Tools Forum
- Virtualization Forum
- More AIX and UNIX Forums
- Check out
get involved in the
Liang Jiang joined IBM in 2000, staring in the AIX back-end technical support team. Later he moved to the AIX base kernel development team, where he works on AIX bring-up on POWER systems, as well as other low-level components within the AIX kernel.
Moriyoshi Ohara is a senior researcher at IBM Tokyo Research Laboratory. He received a Ph.D. in electrical engineering from Stanford University in 1996. His current research interests include microprocessor architectures and workload characterizations for commercial servers.
Sathiskumar Palaniappan is a software engineer at IBM India Labs, Bangalore. He joined the IBM Java Technology Center in 2007 and has been part of the net and nio library development. He has worked in WebSphere Real Time functional testing and enjoys working with run-time technologies.
Thomas Chen is a Senior Technical Staff Member in IBM STG Integrated Systems Development. His responsibility is in system design and its performance, including enhancing PowerPC Architecture and design as well as in characterizing the emerging workloads for the design of future processors and servers. He has been awarded more than 20 patents spanning various technical areas. Dr, Chen was named an IBM Austin Master Inventor in 2007. His technical interests include processor micro-architecture design, I/O and networking subsystem design, workload analysis and characterization, and performance modeling and analysis. He received a Ph.D. degree in computer engineering from the State University of New York at Buffalo in 1989.