High-performance network programming, Part 2: Speed up processing at both the client and server

This article provides more techniques for UNIX®-based programmers who want to enhance their network throughput. Learn how to speed up processing at both the client and server using mmap, gathering scattered I/O, and other methods.

Girish Venkatachalam (girish1729@gmail.com), Open Source Consultant and Evangelist

Photo of Girish VenkatachalamGirish Venkatachalam has over ten years of experience as a UNIX programmer. He developed IPsec for the Nucleus operating system for an embedded system. His interests include cryptography, multimedia, networking, and embedded systems. He also likes to swim, cycle, do yoga, and is a fitness freak. You can reach him at girish1729@gmail.com.



16 October 2007

Also available in Chinese Russian

Introduction

Part 1 of this series (see Resources) addressed some tricks for maximizing network utilization with advanced programming techniques such as non-blocking I/O. This article expands on a few more tricks you can use.

Most file storage is with hard disks these days. Being a mechanical device, hard disks can never approach the speeds of primary storage, such as RAM or the network, in certain cases. Using SCSI or SATA is recommended, since it gives a rapid improvement in throughput, as opposed to good old IDE disks.

You should also ensure that your disk is using DMA for transfers.

As a programmer, you try to figure out ways to reduce disk latency. Closely related topics are: minimizing system call context switch overhead and memory copy overhead (memcpy(2)). Using a sane value for your TCP send and receive buffers is useful to reduce system call overhead. A value of 8192 or 0x8000 is typically used in most UNIX® code.

Of course, you should also:

  • Reduce loading the client or server CPU by moving expensive code out of loops.
  • Never use sleep(2) or synchronization primitives.
  • Never ever think of using threads for enhancing network performance.

The versatile mmap(2) method

You've heard of memory mapped I/O and I/O mapped I/O. It is usually employed in device drivers and when talking to peripheral devices. You can map the primary memory to directly point to a file on secondary storage, such as a hard disk, using the mmap(2) system call on UNIX systems.

mmap(2) is quite a versatile system call, such as select(2), but here you are concerned mainly about enhancing the performance of disk I/O and reducing memory copy overhead. Both can be done by resorting to mmap(2) to read file contents into main memory instead of using read(2)/fread(2) system calls.

mmap(2) is able to give a performance boost because redundant buffer copies are done away with. However, the usage semantics are not entirely obvious. You have to allocate the main memory space using a ftruncate(2) call to let the kernel know how much space to allocate to map the file for writing using mmap(2). Listing 1 shows the details.

Listing 1. mmap(2) file write
/**************************************************************/
/**************************************************************/

	/******************************************
	 *        mmap(2) file write              *
	 *                                        *
	 *****************************************/
	caddr_t *mm = NULL;

	fd = open (filename, O_RDWR | O_TRUNC | O_CREAT, 0644);

	if(-1 == fd)
		errx(1, "File write");
	/* NOT REACHED */

	/* If you don't do this, mmapping will never
	 * work for writing to files 
	 * If you don't know file size in advance as is
	 * often the case with data streaming from the
	 * network, you can use a large value here. Once
	 * write out the whole file, you can shrink it
	 * to the correct size by calling ftruncate
	 * again
	 */

	ret = ftruncate(ctx->fd,filelen);

	mm = mmap(NULL, header->filelen, PROT_READ | PROT_WRITE, 
			MAP_SHARED, ctx->fd, 0);
	if (NULL == mm)
		errx(1, "mmap() problem");

	memcpy(mm + off, buf, len);
	off += len;
	/* Please don't forget to free mmap(2)ed memory!  */
	munmap(mm, filelen);
	close(fd);

/**************************************************************/
/**************************************************************/

	/******************************************
	 *          mmap(2) file read             *
	 *                                        *
	 *****************************************/
	fd = open(filename, O_RDONLY, 0);
	if ( -1 == fd) 
		errx(1, " File read err");
	/* NOT REACHED */

	fstat(fd, &statbf);
	filelen = statbf.st_size;

	mm = mmap(NULL, filelen, PROT_READ, MAP_SHARED, fd, 0);

	if (NULL == mm) 
		errx(1, "mmap() error");
	/* NOT REACHED */

	/* Now onwards you can straight away 
	 * do a memory copy of the mm pointer as it
	 * will dish out file data to you 
	 */


	bufptr = mm + off;
	/* You can straight away copy mmapped memory into the 
	   network buffer for sending */

	memcpy(pkt.buf + filenameoff, bufptr, bytes);

	/* Please don't forget to free mmap(2)ed memory!  */
	munmap(mm, filelen);
	close(fd);

You can use mmap(2) for reading files to push down the network and also for dumping the network contents into the file system at the other side.

mmap(2) can occasionally help and, at times, it can cause harm. For example, when you try to mmap(2) a region over NFS, it can cause harm. However, in most other situations, it is better to use mmap(2) wherever possible.

Scatter gather I/O with readv or writev

In addition to mmap(2), you can use another technique, called uio or scatter gather I/O, to speed up processing at the client and server. Instead of using a single array of bytes as buffer, you can directly manipulate an array of buffers, each of which could point to data from different sources or destinations. This technique has limited applicability, but it can save you on a rainy day. For instance, you could populate headers from a different location and combine them without copying by directly using writev(2) instead of more than one write(2), or a single write(2) with more than one memcpy(2).

To complicate your programming, using uio with non-blocking I/O is not trivial, as shown in Listing 2 below.

Listing 2. using uio with non-blocking I/O is not trivial
writeiovall(int fd, struct iov *iov, int nvec) {

	int i, bytes;

	i = 0;
	while (i < nvec) {
		do
		{
			rv = writev(fd, &vec[i], nvec - i);
		} while (rv == -1 &&
				(errno == EINTR || errno == EAGAIN));

		if (rv == -1) {
			if (errno != EINTR && errno != EAGAIN) {
				perror("write");
			}
			return -1;
		}
		bytes += rv;
		/* recalculate vec to deal with partial writes */
		while (rv > 0) {
			if (rv < vec[i].iov_len) {
				vec[i].iov_base = (char *) vec[i].iov_base + rv;
				vec[i].iov_len -= rv;
				rv = 0;
			}
			else {
				rv -= vec[i].iov_len;
				++i;
			}
		}
	}

	/* We should get here only after we write out everything */

	return 0;

}

In the code, the socket write can either write a partial uio buffer, or write few uio buffers completely. Or, they can write few uio buffers and write one of the buffers partially. Figure 1 helps to show the difficulty.

Figure 1. uio with non-blocking sockets
uio with non-blocking sockets

Other advanced techniques

The BSD kernel code uses a linked list of buffers called mbufs. You can read about them in any BSD system in the man page of mbuf(9), or in many papers written on the topic. A thorough discussion is beyond the scope of this article, but a brief mention is in order.

Basically, mbufs are a linked list of buffers with each buffer holding anywhere between 128 bytes and 1024 bytes. They typically hold around 128 bytes.

Since it is the kernel's responsibility to clear the hardware buffers on the network interface card as soon as data arrives, the mbuf subsystem is very efficient in processing network data. On a gigabit network, your kernel can get very loaded if this is not done properly.

Each mbuf contains three important pieces of information:

  • Where the data begins
  • Size of data
  • Type of data

It contains a lot more, but these are the key fields that help processing.

There are a dozen of macros and functions that help you retrieve headers, remove headers after processing, and append and prepend data. It is only by using such a sophisticated framework that the kernel is able to do its job so efficiently.

If you are really interested in getting the last bit of performance for your application, then you could try implementing something similar for your user-level protocol handling. Make sure the overhead of the code does not offset your other processing overhead.

Instead of using disks, you could use flash memory, write a customized disk driver, and go for RAID.

Enhancing performance is like a never ending story. It goes on and on and on...

Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=261796
ArticleTitle=High-performance network programming, Part 2: Speed up processing at both the client and server
publish-date=10162007