Speaking UNIX

Peering into pipes

Track the progress of protracted operations with Pipe Viewer

Content series:

This content is part # of # in the series: Speaking UNIX

Stay tuned for additional content in this series.

This content is part of the series:Speaking UNIX

Stay tuned for additional content in this series.

One of the cleverest and most powerful innovations in UNIX is the shell. It's more efficient than a GUI, and you can write scripts to automate many tasks. Better yet, the pipe operator assembles ad hoc programs right at the command line. The pipe chains commands in sequence, where the output of an earlier command becomes the input of a subsequent command.

But the pipe has one major detractor: It's something of a black box. If you string commands together, the only evidence of progress is the output that the last command in the series generates. Yes, you can interject tee in the sequence, and you can watch an output file grow with tail, but those solutions work best once, lest the standard output (stdout) and standard error (stderr) of multiple phases commingle. Further, both solutions are crude indicators and likely mask how much computation each step requires.

Of course, you could deconstruct a complex sequence into multiple individual steps, each with its own interim output file. And indeed, if you want to verify results at each interval, decomposition is ideal. Write a script, produce one data file for each step, use a data file between each pair of steps as input, and collect the final file as the ultimate result. However, such a practice is not well suited to the impromptu nature of the command line.

What's needed is a progress meter that you can embed in the command line to measure throughput. Ideally, the meter could be repeated to benchmark each step—and because the sky's the limit, the tool would be open source and portable to multiple UNIX variants, such as Linux® and Mac OS X.

Well, wish no more: Pipe Viewer (pv), written by systems administrator Andrew Wood and enhanced by many other developers over the course of the past four years, provides a peek into command-line "plumbing." As stated on its project page, pv "can be inserted into [a] pipeline between two processes to give a visual indication of how quickly data is passing through, how much time has elapsed so far, and how near completion [it is]." Remarkably, you can also insert multiple instances of pv into the same command line to show relative throughput.

This article shows you how to build pv on a UNIX system and apply it to simple and complex command-line combinations. Let's start, though, with a review of how pipes connect processes.

UNIX pipes: Plumbing for processes

Figure 1 shows the steps for creating a pipe to connect two independent processes.

Figure 1. Creating a pipe to connect two processes
Steps used to create a pipe
Steps used to create a pipe

At the outset, Phase 1, the progenitor process reads from standard input stdin, writes output to stdout, and emits errors to stderr. Each of stdin, stdout, and stderr is a file descriptor, or a handle to a file. Each operation on a file handle—open, read, write, rewind, truncate, and close, for example—affects the state of the file.

Next, in Phase 2, the progenitor creates a pipe. A pipe is composed of a queue and two file descriptors—one to enqueue data and the other to dequeue data. A pipe is a first-in-first out (FIFO) data structure.

By itself, a pipe has little use; its purpose is to connect a producer to a consumer. Hence, the progenitor forks, or creates, a second process in Phase 3 to act as a counterpart.

In Phase 4 (and assuming that the new process is the consumer), the original process replaces its stdout with the producer end of the pipe and rewires the newly forked process to treat the consumer end of the pipe and its stdin. After these adjustments, each write by the original process (now the producer) is enqueued and subsequently read by the new process (now the consumer).

Phases 1 through 4 mirror the process your shell uses to connect one utility to another with the command-line pipe operator (|), although the shell spawns a new process for each utility and leaves itself untouched to perform job control.

For example, Figure 2 shows how a find, grep, and wc command might be connected via pipes to find and count all files with names that begin with lowercase a. The shell remains independent; find is a producer, grep acts as a consumer (for find) and as a producer (for wc). wc acts a consumer and producer, too: It consumes from grep and produces output to stdout. Typically, the shell connects stdout to a terminal, but redirection can reroute the output to a file.

Figure 2. Connecting commands using pipes
Connecting and counting files with names that begin with x
Connecting and counting files with names that begin with x

If you want to peer into two UNIX processes, the create two pipes and rewire the file descriptors of each process to act both as a producer and a consumer. Figure 3 shows an interprocess exchange that overrides both processes' stdin and stdout.

Figure 3. Looking into two UNIX processes
an interprocess exchange that Looks into two UNIX processes

Given that brief review, let's look at Pipe Viewer.

Pipe Viewer: Conspicuous conduit

Pipe Viewer is an open source application. You can download its source code and build the application from scratch or, if available, pull an existing binary from your UNIX distribution's repository.

To build from scratch, download the latest source tarball from the Pipe Viewer project page (see Related topics). As of mid-September 2009, the latest version of the code is 1.1.4. Unpack the tarball, change to the newly created directory, and type ./configure followed by make and sudo make install. By default, the build process installs the executable named pv into /usr/local/bin. (For a list of configuration options, type ./configure --help.) Listing 1 shows the installation code.

Listing 1. Pipe Viewer installation code
$ wget
$ tar xjf pv-1.1.4.tar.bz2
$ cd pv-1.1.4
$ ./configure
$ make
$ sudo make install
$ which pv

To pull the pv binary from a repository, use your distribution's package manager and search for either pv or pipe viewer. For example, a search using Ubuntu version 9's APT package manager yields this match:

$ apt-cache search part viewer
pv - Shell pipeline element to meter data passing through

To continue, use your package manager to download and install the package. For Ubuntu, the command is apt-get install:

$ sudo apt-get install pv

Once installed, give pv a try. The simplest use replaces the traditional cat utility with pv to feed bytes to another program and measure overall throughput. For instance, you can use pv to monitor a lengthy compress operation:

$ ls -lh listings.txt
-r--r--r--  1 supergiantrobot  staff   109M Sep  1 20:47 listings.txt
$ pv listings.txt | gzip > listings.gz
96.1MB 0:00:09 [11.3MB/s] [=====================>     ] 87% ETA 0:00:01

When the command launches, pv posts a progress bar and continually updates the gauge to show headway. From left to right, the typical pv display shows how much data has been processed so far, the time elapsed, throughput in megabytes/second, a visual and numeric representation of work complete, and an estimate of how much time remains. In the display above, 96.1MB of 109MB has been processed, leaving about 13 percent of the file to go after 9 seconds of work.

By default, pv renders all the status indicators for which it is able to calculate values. For instance, if the input to pv is not a file and no specific size is manually specified, the progress bar advances from left to right to show activity, but it cannot measure the percent complete without a baseline. Here's an example:

$ ssh faraway tar cf - projectx | pv --wait > projectx.tar
4.34MB 0:00:07 [ 611kB/s] [      <=>                  ]

This example runs tar on a remote machine and sends the output of the remote command to the local system to create projectx.tar. Because pv cannot calculate the total number of bytes to expect in the transfer, it shows throughput so far, time elapsed, and a special indicator that reflects activity. The little "car" (<=>) travels left to right as long as data is streaming through.

The --wait option delays the rendering of the progress meter(s) until the first byte is actually received. Here, --wait is useful, because the ssh command may prompt for a password.

You can enable individual indicators at your discretion with eponymous flags:

$ ssh faraway tar cf - projectx | \
  pv --wait --bytes > projectx.tar

The latter command enables the running byte count with --bytes. The other options are --progress, --timer, --eta, --rate, and --numeric. If you specify one or more display options, all remaining (unnamed) indicators are automatically disabled.

There is one other simple use of pv. The --rate-limit option can throttle throughput. The argument to this option is a number and a suffix, such as m to indicate megabytes/second:

$ ssh faraway tar cf - projectx | \
  pv --wait --quiet --rate-limit 1m > projectx.tar

The previous command hides all indicators (--quiet) and limits throughout to 1MB/s.

Advanced usage of Pipe Viewer

So far, the examples shown employ a single instance of Pipe Viewer as the producer or consumer in a pair of commands. However, more complex combinations are also possible. You can use pv multiple times in the same command line, with some provisos. Specifically, you must name each instance of pv using --name, and you must enable multiline mode with --cursor. Combined, the two options create a series of labeled indicators, one indicator per named instance.

For example, imagine you want to monitor the progress of a data transfer and its compression separately and simultaneously. You can assign one instance of pv to the former operation and another to the latter, like so:

$ ssh faraway tar cf - projectx | pv --wait --name ssh | \
  gzip | pv --wait --name gzip > projectx.tgz

After you type a password, the Pipe Viewer commands produce a two-line progress meter:

  ssh: 4.17MB 0:00:07 [ 648kB/s] [     <=>             ]
       gzip:  592kB 0:00:06 [62.1kB/s] [   <=>               ]

The first line is labeled ssh and shows the progress of the transfer; the second line, tagged gzip, shows the progression of the compression. Because each command cannot determine the number of bytes in its respective operation, the accumulated totals and the activity bar are shown on each line.

If you know or are able to approximate or calculate the number of bytes in an operation, use the --size option. Adding this option provides some finer-grained detail in the progress bars.

For instance, if you want to monitor the progress of a significant archiving task, you can use other UNIX utilities to approximate the total size of the original files. The df utility can show statistics for an entire file system, while du can calculate the size of an arbitrarily deep hierarchy:

$ tar cf - work | pv --size `du -sh work | cut -f1` > work.tar

Here, the subshell command du -sh work | cut -f1 yields the total size of the work directory in a format compatible with pv. Namely, du -h produces a human-readable format such as 17M for 17 megabytes—perfect for use with pv. (The ls and df commands also support -h for human-readable format.) Because pv now expects a specific number of bytes to transit through the pipe, it can render a true progress bar:

700kB 0:00:07 [ 100kB/s] [>                    ]  4% ETA 0:02:47

Finally, there is one additional technique you're sure to find useful. Beside counting bytes, Pipe Viewer can visualize progress by counting lines. If you specify the modifier --line-mode, pv advances the progress meter each time a newline is encountered. You can also provide --size, and the number is interpreted as the expected number of lines.

Here's an example. Oftentimes, find is helpful for locating a needle in a haystack, such as locating all the uses of a particular system call in a large body of application code. In such circumstances, you might run something like this:

$ find . -type f -name '*.c' -exec grep --files-with-match fopen \{\} \; > results

This code finds all C source files and emits the file's name if the string fopen appears anywhere in the file. Output is collected in a file named results. To reflect activity, add pv to the mix:

$ find . -type f -name '*.c' -exec grep --files-with-match fopen \{\} \; | \
  pv --line-mode > results

Line mode is phenomenal, because many UNIX commands, like find, operate on a file's metadata, not on the contents of the file. Line mode is ideal for systems administration scripts that copy or compress large collections of files.

In general, you can inject Pipe Viewer into command lines and scripts whenever rate is measurable. You may have to get creative, though. For example, to measure how quickly a directory is copied, switch from cp -pr to tar:

$ # an equivalent of cp -pr old/somedir new
$ (cd old; tar cf - somedir) | pv | (cd new; tar xf - )

You might also consider line mode for use with networking utilities such as wget, curl, and scp. For instance, you can use pv to measure the progress of a sizable upload. And because many of the networking tools can take input from a file, you can use the length of such a file as an argument to --size.

A little gem

Pipe Viewer is one of those little-known gems that once you find it, you can't recall how you lived without it. You may find some applications of pv in your daily command-line use, but you are likely to find oodles of uses for it in your automation scripts. Rather than stare at a blinking cursor waiting patiently for some indication that all is well, you can now insert a probe to give you real-time feedback. Pipe Viewer adds a heartbeat to the soul of the machine.

Downloadable resources

Related topics

Zone=AIX and UNIX
ArticleTitle=Speaking UNIX: Peering into pipes