Application programs commonly use at least
one form
of time measurement (for example, transaction time stamps). For that purpose,
POSIX libraries provide timing functions, such as gettimeofday, which
give an easy-to-use interface to application programmers. When the
application makes an intensive use of timing functions, a more efficient
implementation of the timing routines can improve the overall
performance of the program. This can be invaluable for lower-level tasks
such as code profiling or precise delays in device drivers and other
time-critical code.
On the Power Architecture platform, it is possible to further improve on the
performance delivered by gettimeofday. The following article presents an
efficient and precise time measurement technique since the UNIX® epoch
(January 1, 1970) for the 64-bit Linux® operating system on PowerPC and
Cell/B.E. PPE processors,
which can provide millisecond and nanosecond precision. Note, however, that some Linux versions provide a similar
implementation of gettimeofday as presented in this article, but you might notice it performs with
lesser precision.
This technique can also be implemented on the AIX operating system. However, the examples and system details described below correspond to the 64-bit Linux implementation.
The Power Architecture documentation (Book II: PowerPC Virtual Environment Architecture, see Resources) provides a counting register called Time Base (TB), which is used to keep track of system time. The TB register is incremented periodically with an implementation-dependent frequency, which may not be constant. It is the responsibility of the operating system (OS) to determine if the update frequency has changed and to make the necessary adjustments to its internal structures.
The PowerPC Architecture states that:
- The TB register is 64 bits long.
- Each update goes by increments of 1.
- The operating system must be capable of determining the update frequency.
- When the TB reaches its maximum value, it overflows and starts over from 0. There is no explicit indication of this and this is left to the OS to handle.
- The OS must initialize the TB register on power-on.
For further details see PowerPC Virtual Environment Architecture (find a link to this and other documents mentioned in this article in the Resources section at the bottom of this page). Note that the Cell/B.E. PPE also implements the TB register.
The TB register alone does not contain sufficient information for time calculation. The Power Architecture specification leaves much of the responsibility of handling the TB register to the operating system, which has to provide some additional data, such as the update frequency, Time Base at boot time, and so on. This whole calculation mechanism is fast and efficient.
In the case of the Linux operating system, the system libraries export
the following information in the systemcfg
structure:
Listing 1. Time Base-specific system information
struct systemcfg {
...
__u64 tb_orig_stamp; /* Time Base at boot */
__u64 tb_ticks_per_sec; /* Time Base ticks / sec */
__u64 tb_to_xs; /* Inverse of TB to 2^20 */
__u64 stamp_xsec;
__u64 tb_update_count; /* Time Base atomicity ctr */
...
};
|
where:
tb_orig_stamp: the value of the TB register at boot timetb_ticks_per_sec: TB register ticks per second (not used in the following calculations)tb_to_xs: used to convert from TB ticks to xsecondsstamp_xsec: contains the time at boot time in xsecondstb_update_count: is used as a count when tb is updated
This structure can be accessed by including the following header file:
#include <asm/systemcfg.h>
How to read the register in assembly language
The mftb mnemonic allows the value of the TB
register to be transferred to a general purpose register. Its syntax is:
mftb rx
The PowerPC Virtual Environment Architecture [1] provides more
technical details about this mnemonic and describes an algorithm to
calculate gettimeofday in assembly.
Note that some Linux distributions provide a library
function called get_tb(), which returns the
value of the TB register. It is implemented as a single mftb assembly instruction.
Using gettimeofday, the implementation of a nanosecond calculation is
as follows:
Listing 2. Obtaining current time of day with nanosecond precision
U64 gtod_nano()
{
struct timeval t;
gettimeofday(&t,NULL);
return t.tv_usec * 1000 + t.tv_sec * 1000000000;
}
|
Time calculation with nanosecond precision
Using the TB register and the values exposed in the systemcfg structure, elapsed time since the epoch with nanosecond precision can be calculated as:
Listing 3. Converting time value to absolute time
nanoseconds_since_epoch =
((( (<TB value> - tb_orig_stamp) * tb_to_xs)
+ stamp_xsec) * <nanoseconds per second>)
/ <xseconds per second>
|
where:
- nanoseconds per second is 1 000 000 000
- xseconds per second is 2^20
In more detail, the calculation can be broken down into:
- Calculate the Time Base ticks elapsed since boot time.
This is the difference between the current value of the TB register and
tb_orig_stamp.elapsed_ticks_since_boot = <TB value> - tb_orig_stamp
-
Calculate the number of elapsed xseconds since boot. This is the number
of elapsed ticks since boot time (1) converted to xseconds using
tb_to_xs.
elapsed_xseconds_since_boot = elapsed_ticks_since_boot * tb_to_xs
-
Calculate the number of xseconds since the epoch. This is the number of
elapsed xseconds since boot (2) added to the time (represented in
xseconds) at boot time stored in
stamp_xsec.xseconds_since_epoch = elapsed_xseconds_since_boot + stamp_xsec
-
Convert the value of xseconds since the epoch (3) to nanoseconds. This
can be done dividing by the number of xseconds in each second and
multiplying by the number of nanoseconds in a second.
nanoseconds_since_epoch = (xseconds_since_epoch * <nanoseconds per second>) / <xseconds per second>
Calculating time with millisecond precision follows the same idea. The difference is that the intermediate result must be converted using milliseconds per second.
Listing 4. Converting to millisecond precision
Milliseconds_since_epoch =
((( (<TB value> - tb_orig_stamp) * tb_to_xs)
+ stamp_xsec) * <milliseconds per second>)
/ <xseconds per second>
|
For the purpose of this article, only nanosecond calculations will be described in detail.
The full algorithm in C that computes the time with nanosecond precision is as follows:
Listing 5. Obtaining time of day with nanosecond precision
#define U64 __u64
U64 tb_nano()
{
#define NANOSECS_PER_SEC 1000000000
U64 tbr;
U64 elapsed_ticks_since_boot;
U64 elapsed_xseconds_since_boot_hi;
U64 elapsed_xseconds_since_boot_lo;
U64 xseconds_since_epoch;
U64 nanoseconds_per_second = NANOSECS_PER_SEC;
U64 nps_temp_hi;
U64 nps_temp_lo;
U64 nanoseconds_since_epoch_hi;
U64 nanoseconds_since_epoch_lo;
U64 nanoseconds_since_epoch;
U64 count_start;
U64 count_end;
do {
count_start = cfg->tb_update_count;
__asm__ __volatile__("mftb %[tbr]" : [tbr] "=r" (tbr):);
elapsed_ticks_since_boot = tbr - cfg->tb_orig_stamp;
multiply(elapsed_ticks_since_boot, cfg->tb_to_xs,
&elapsed_xseconds_since_boot_hi,
&elapsed_xseconds_since_boot_lo);
xseconds_since_epoch =
cfg->stamp_xsec + elapsed_xseconds_since_boot_hi;
count_end = cfg->tb_update_count;
} while (count_start != count_end && count_start & 0x1 != 0);
multiply(xseconds_since_epoch, nanoseconds_per_second,
&nps_temp_hi,
&nps_temp_lo);
nanoseconds_since_epoch_lo = nps_temp_lo >> 20;
nanoseconds_since_epoch_hi = nps_temp_hi << (64 - 20);
nanoseconds_since_epoch =
nanoseconds_since_epoch_hi | nanoseconds_since_epoch_lo;
return nanoseconds_since_epoch;
}
|
In further detail:
-
Read the value of the TB register into the 64-bit variable tbr. It is
necessary to use assembly since there is no mechanism provided in C to
access this register.
__asm__ __volatile__ ("mftb %[tbr]" : [tbr] "=r" (tbr):);
-
Calculate
elapsed_ticks_since_bootas the difference between the TB register andtb_orig_stamp.elapsed_ticks_since_boot = tbr - cfg->tb_orig_stamp;
-
Calculate
elapsed_xseconds_since_bootas the product ofelapsed_ticks_since_bootandtb_to_xsmultiply(elapsed_ticks_since_boot, cfg->tb_to_xs,
&elapsed_xseconds_since_boot_hi,
&elapsed_xseconds_since_boot_lo);Note that it is necessary to use a helper multiply function. Both operands,
elapsed_ticks_since_bootandcfg->tb_to_xsare 64 bits long, and their product results in a 128-bit number. Such a large value cannot be represented in a single C variable, thus the helper multiply function will split it in two parts, in this case calledelapsed_xseconds_since_boot_hiandelapsed_xseconds_since_boot_lo. The details of the multiply algorithm will be described in the section headed Multiply as a support function.
-
Calculate
xseconds_since_epochas the addition ofelapsed_xseconds_since_boot_hiandstamp_xsec.xseconds_since_epoch =
cfg->stamp_xsec + elapsed_xseconds_since_boot_hi;
-
Calculate
nanoseconds_since_epochasxseconds_since_epochconverted to seconds dividing byxseconds_per_second(20-bit right shift) and converting the result to nanoseconds multiplying bynanoseconds_per_second. Note that the algorithm performs a multiply first and a divide (shift) later. This gives extra precision.multiply(xseconds_since_epoch, nanoseconds_per_second, &nps_temp_hi, &nps_temp_lo); nanoseconds_since_epoch_lo = nps_temp_lo >> 20; nanoseconds_since_epoch_hi = nps_temp_hi << (64 - 20); nanoseconds_since_epoch = nanoseconds_since_epoch_hi | nanoseconds_since_epoch_lo;
Note that most of the calculation is guarded against value
changes to tb_update_count, which is stored
in variables count_start and count_end. This is used to guarantee that the
values read from tb_to_xs and stamp_xsec variables are consistent. The values for
such variables can change when the operating system detects that the
update frequency changed, or if the operating system initiated the
frequency change.
Most of the calculation is guarded against value changes to tb_update_count, which is stored in the variables
count_start and count_end. This is used to guarantee that the
values read from tb_to_xs and stamp_xsec variables are consistent. The values for
such variables can change when the operating system detects that the
update frequency has changed, or if the operating system initiated the
frequency change.
The Linux libc library explains how tb_update_count is updated and how its values must
be interpreted:
The code which updates these variables incrementstb_update_count, then updates the two variables and then incrementstb_update_countagain. This code readstb_update_count, reads the two variables and then readstb_update_countagain. It loops doing this until the two reads oftb_update_countyield the same value and that value is even. This ensures a consistent view of the two variables (see Resources for a citation).
Furthermore, tb_update_count is used to
detect when the TB register overflows. When that happens, the same
protocol is followed; tb_update_count is
updated while the necessary changes are reflected in the rest of the
variables. It is the operating system's responsibility to detect such a
scenario.
Finally, it is interesting to mention that a 64-bit nanosecond value can represent approximately 585 years of elapsed time.
Multiply as a support function
In the implementation of the Time Base calculation above, the algorithm uses a helper multiply function. Since the operands for the multiplication are 64-bit values, the result of their product is a 128-bit number. This result cannot be represented in a single variable in C, so the multiply function returns the result in two different output parameters.
The function is implemented as follows:
Listing 6. An optimized 64-bit multiply function
void multiply (U64 val1, U64 val2, U64 *res_hi, U64 *res_lo)
{
U64 half1, half2;
/* Even though we use 32 bits here, the compiler needs 64-bit
registers */
U64 half1_lo, half1_hi;
/* Higher bits of val 1 */
U64 val1_hi = (val1 >> 32) & 0xFFFFFFFF;
/* Lower bits of val 1 */
U64 val1_lo = val1 & 0xFFFFFFFF;
/* Higher bits of val 2*/
U64 val2_hi = (val2 >> 32) & 0xFFFFFFFF;
/* Lower bits of val 2*/
U64 val2_lo = val2 & 0xFFFFFFFF;
U64 mult1 = val1_lo * val2_lo;
U64 mult1_overflow = ((mult1 >> 32) & 0xFFFFFFFF);
U64 mult2 =
val1_hi * val2_lo + val2_hi * val1_lo + mult1_overflow;
U64 mult2_overflow = ((mult2 >> 32) & 0xFFFFFFFF);
half1_lo = (mult1 & 0xFFFFFFFF);
half1_hi = (mult2 & 0xFFFFFFFF);
half1 = (half1_hi << 32) | half1_lo;
half2 = val1_hi * val2_hi + mult2_overflow;
*res_hi = half2;
*res_lo = half1;
|
The function receives two 64-bit input values, called val1 and val2. These are
then separated in the hi and low halves, called: val1_hi, val1_lo, val2_hi and val2_lo. The
multiplication is performed as:
Table 1
X | val1_hi | val1_lo | |
| val2_hi | val2_lo | ||
|
val1_hi x val2_hi + mult2_overflow |
val1_hi x val2_lo + val2_hi x val1_lo + mult1_overflow | val1_lo x val2_lo | |
| half 2 | mult2 | mult1 | |
| half1 | |||
This follows the same idea as the decimal multiplication algorithm.
The lower part of the lower half of the product is val1_lo multiplied by val2_lo, which is called mult1. The high part of the lower half, called
mult2, is val1_hi
multiplied by val2_lo, plus val2_hi multiplied by val1_lo. Also, any overflow from the mult1
multiplication must be added to mult2.
The concatenation of the values of mult2 and
mult1 will form the lower 64-bit half of the
result, called half1.
The high half, called half2, is the result
of multiplying val1_h1 and val2_hi, to which any overflow from mult2 must be added.
Another alternative is to use the multiply
assembly instructions:
Listing 7. Assembly language version of multiply function
void multiply (U64 val1, U64 val2, U64 *high, U64 *low)
{
U64 h;
U64 l;
__asm__ __volatile__("mulhdu %[h],%[val1],%[val2]" :
[h] "=r" (h) : [val1] "g" (val1), [val2] "g" (val2));
__asm__ __volatile__("mulld %[l],%[val1],%[val2]" :
[l] "=r" (l) : [val1] "g" (val1), [val2] "g" (val2));
*high = h;
*low = l;
}
|
However, the performance delivered by this algorithm is not as good as its pure C equivalent. The IBM XL C compiler is a very good optimizer, and as a result, the C algorithm delivers superior results to a simple unoptimized assembly language implementation.
There are other variations that can be done in assembly (like using shifts and additions instead of multiply instructions), but they go beyond the scope of this article.
Advantages of the Time Base approach
There are two advantages to the implementation of nanosecond calculation using the Time Base register:
- Performance:
As you can see from the experimental results in the next section, the
methods described here are considerably faster than simply invoking
gettimeofday. - Precision:
gettimeofdayprecision is fixed to the microsecond. To get a result expressed in nanoseconds, it is necessary to multiply by 1000, thus losing precision. The Time Base implementation already delivers the result in nanoseconds. Both implementations can easily support millisecond precision with an appropriate division operation.
This section presents the performance results for three different algorithms. All tests were compiled with optimization level of -O3, using the IBM XL C compiler, and executed on RHEL Linux running on a p630 POWER4+™ microprocessor.
The algorithms compared below are:
- tb: The implementation of nanosecond calculation presented above.
- tb_libc: The libc implementation of
gettimeofdayusing the TB register, but recompiled in a standalone program (some Linux implementations use this algorithm). - gtod: Calls to
gettimeofday.
Figure 1 represents execution time (in microseconds) versus number of invocations.
Figure 1.

As the graph shows, the implementation presented in this article is the fastest, and it scales similarly to the recompiled libc implementation.
The tb algorithm is 20% faster than tb_libc, and 300% faster than gtod.
Using the gettimeofday interface causes a significant performance hit, and a
much lower scalability. One possibility for this degradation is that the
Linux libc libraries are not compiled using the appropriate optimization
level.
In summary, then, you have seen how the PowerPC core (including Cell/B.E. variants) offers a convenient hardware feature for measuring elapsed time with fine precision, quickly and scalably. Example applications that might benefit from this sort of code include high-load transaction servers that need to timestamp individual events, and cryptographic applications where the time of day is included in message data to provide time sensitivity of data.
Learn
-
The PowerPC
Architecture
Book II: PowerPC Virtual Environment Architecture, Chapter 4, describes the Time Base counting register.
-
See the
Cell
Broadband Engine
Registers specification in the IBM Semiconductor solutions library.
-
The Linux libc library explains how tb_update_count is updated and how its values must be interpreted.
-
The
IBM Semiconductor solutions technical library
Cell Broadband Engine
documentation section contains a wealth of downloadable manuals,
specifications, and much more.
-
Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E..
-
Stay abreast of all things Cell/B.E.:
subscribe to IBM
microNews.
Get products and technologies
-
Get Cell: Contact
IBM about custom Cell-based or custom-processor based solutions.
-
Get the alphaWorksCell Broadband Engine downloads -- including the IBM Full System Simulator,
support libraries, toolchains, source code for libraries and samples.
Discuss
- Participate in the discussion forum.
-
Take part in the IBM developerWorks Power Architecture Cell
Broadband Engine discussion forum.



