Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

The Power Architecture Time Base register in 64-bit Linux

Bonus PowerPC hardware features let you measure time efficiently and precisely

Carlos Cavanna (ccavanna@ca.ibm.com), Software Developer, IBM Software Group, Application and Integration Middleware
Carlos Cavanna is a Software Developer in IBM Compilers Group, specializing in Java JIT (Java Just-In-Time) compilation for the PowerPC Architecture.

Summary:  Use the Power Architecture™ technology's Time Base register to measure time at the nanosecond level in Linux® on PowerPC® and Cell Broadband Engine™ (Cell/B.E.) microprocessors. Applications where this is useful include timestamping transactions (typically encrypted or digitally signed single-use messages), profiling code, and implementing small, precise software delays.

Date:  04 Apr 2007
Level:  Intermediate
Also available in:   Japanese

Activity:  18502 views
Comments:  

Application programs commonly use at least one form of time measurement (for example, transaction time stamps). For that purpose, POSIX libraries provide timing functions, such as gettimeofday, which give an easy-to-use interface to application programmers. When the application makes an intensive use of timing functions, a more efficient implementation of the timing routines can improve the overall performance of the program. This can be invaluable for lower-level tasks such as code profiling or precise delays in device drivers and other time-critical code.

On the Power Architecture platform, it is possible to further improve on the performance delivered by gettimeofday. The following article presents an efficient and precise time measurement technique since the UNIX® epoch (January 1, 1970) for the 64-bit Linux® operating system on PowerPC and Cell/B.E. PPE processors, which can provide millisecond and nanosecond precision. Note, however, that some Linux versions provide a similar implementation of gettimeofday as presented in this article, but you might notice it performs with lesser precision.

This technique can also be implemented on the AIX operating system. However, the examples and system details described below correspond to the 64-bit Linux implementation.

The Time Base register

The Power Architecture documentation (Book II: PowerPC Virtual Environment Architecture, see Resources) provides a counting register called Time Base (TB), which is used to keep track of system time. The TB register is incremented periodically with an implementation-dependent frequency, which may not be constant. It is the responsibility of the operating system (OS) to determine if the update frequency has changed and to make the necessary adjustments to its internal structures.

The PowerPC Architecture states that:

  • The TB register is 64 bits long.
  • Each update goes by increments of 1.
  • The operating system must be capable of determining the update frequency.
  • When the TB reaches its maximum value, it overflows and starts over from 0. There is no explicit indication of this and this is left to the OS to handle.
  • The OS must initialize the TB register on power-on.

For further details see PowerPC Virtual Environment Architecture (find a link to this and other documents mentioned in this article in the Resources section at the bottom of this page). Note that the Cell/B.E. PPE also implements the TB register.

The TB register alone does not contain sufficient information for time calculation. The Power Architecture specification leaves much of the responsibility of handling the TB register to the operating system, which has to provide some additional data, such as the update frequency, Time Base at boot time, and so on. This whole calculation mechanism is fast and efficient.

In the case of the Linux operating system, the system libraries export the following information in the systemcfg structure:


Listing 1. Time Base-specific system information
struct systemcfg {
...
__u64 tb_orig_stamp;            /* Time Base at boot       */
__u64 tb_ticks_per_sec;         /* Time Base ticks / sec    */
__u64 tb_to_xs;                 /* Inverse of TB to 2^20  */
__u64 stamp_xsec;
__u64 tb_update_count;          /* Time Base atomicity ctr */
...
};

where:

  • tb_orig_stamp: the value of the TB register at boot time
  • tb_ticks_per_sec: TB register ticks per second (not used in the following calculations)
  • tb_to_xs: used to convert from TB ticks to xseconds
  • stamp_xsec: contains the time at boot time in xseconds
  • tb_update_count: is used as a count when tb is updated

This structure can be accessed by including the following header file:

#include <asm/systemcfg.h>


How to read the register in assembly language

The mftb mnemonic allows the value of the TB register to be transferred to a general purpose register. Its syntax is:

mftb rx

The PowerPC Virtual Environment Architecture [1] provides more technical details about this mnemonic and describes an algorithm to calculate gettimeofday in assembly.

Note that some Linux distributions provide a library function called get_tb(), which returns the value of the TB register. It is implemented as a single mftb assembly instruction.


Using gettimeofday

Using gettimeofday, the implementation of a nanosecond calculation is as follows:


Listing 2. Obtaining current time of day with nanosecond precision
U64 gtod_nano()
{
     struct timeval t;
     gettimeofday(&t,NULL);
     return t.tv_usec * 1000 + t.tv_sec * 1000000000;
}

Time calculation with nanosecond precision

Using the TB register and the values exposed in the systemcfg structure, elapsed time since the epoch with nanosecond precision can be calculated as:


Listing 3. Converting time value to absolute time
nanoseconds_since_epoch =
                 ((( (<TB value> - tb_orig_stamp) * tb_to_xs)
                 + stamp_xsec) * <nanoseconds per second>)
                 / <xseconds per second>

where:

  • nanoseconds per second is 1 000 000 000
  • xseconds per second is 2^20

In more detail, the calculation can be broken down into:

  1. Calculate the Time Base ticks elapsed since boot time. This is the difference between the current value of the TB register and tb_orig_stamp.

    elapsed_ticks_since_boot = <TB value> - tb_orig_stamp

  1. Calculate the number of elapsed xseconds since boot. This is the number of elapsed ticks since boot time (1) converted to xseconds using tb_to_xs.

    elapsed_xseconds_since_boot = elapsed_ticks_since_boot * tb_to_xs

  1. Calculate the number of xseconds since the epoch. This is the number of elapsed xseconds since boot (2) added to the time (represented in xseconds) at boot time stored in stamp_xsec.

    xseconds_since_epoch = elapsed_xseconds_since_boot + stamp_xsec

  1. Convert the value of xseconds since the epoch (3) to nanoseconds. This can be done dividing by the number of xseconds in each second and multiplying by the number of nanoseconds in a second.

    nanoseconds_since_epoch = (xseconds_since_epoch * <nanoseconds per second>) / <xseconds per second>

Calculating time with millisecond precision follows the same idea. The difference is that the intermediate result must be converted using milliseconds per second.


Listing 4. Converting to millisecond precision
Milliseconds_since_epoch =
              ((( (<TB value> - tb_orig_stamp) * tb_to_xs)
                 + stamp_xsec) * <milliseconds per second>)
         / <xseconds per second>

For the purpose of this article, only nanosecond calculations will be described in detail.

The full algorithm

The full algorithm in C that computes the time with nanosecond precision is as follows:


Listing 5. Obtaining time of day with nanosecond precision
#define U64 __u64

U64 tb_nano()
{
     #define NANOSECS_PER_SEC 1000000000

     U64 tbr;
     U64 elapsed_ticks_since_boot;
     U64 elapsed_xseconds_since_boot_hi;
     U64 elapsed_xseconds_since_boot_lo;
     U64 xseconds_since_epoch;
     U64 nanoseconds_per_second = NANOSECS_PER_SEC;
     U64 nps_temp_hi;
     U64 nps_temp_lo;
     U64 nanoseconds_since_epoch_hi;
     U64 nanoseconds_since_epoch_lo;
     U64 nanoseconds_since_epoch;

     U64 count_start;
     U64 count_end;
     do {
        count_start = cfg->tb_update_count;

        __asm__ __volatile__("mftb %[tbr]" : [tbr] "=r" (tbr):);

        elapsed_ticks_since_boot = tbr - cfg->tb_orig_stamp;

        multiply(elapsed_ticks_since_boot, cfg->tb_to_xs,
                 &elapsed_xseconds_since_boot_hi,
                 &elapsed_xseconds_since_boot_lo);

        xseconds_since_epoch =
                 cfg->stamp_xsec + elapsed_xseconds_since_boot_hi;

        count_end = cfg->tb_update_count;

     } while (count_start != count_end && count_start & 0x1 != 0);

     multiply(xseconds_since_epoch, nanoseconds_per_second,
              &nps_temp_hi,
              &nps_temp_lo);

     nanoseconds_since_epoch_lo = nps_temp_lo >> 20;
     nanoseconds_since_epoch_hi = nps_temp_hi << (64 - 20);

     nanoseconds_since_epoch =
          nanoseconds_since_epoch_hi | nanoseconds_since_epoch_lo;

     return nanoseconds_since_epoch;
}

In further detail:

  1. Read the value of the TB register into the 64-bit variable tbr. It is necessary to use assembly since there is no mechanism provided in C to access this register.

    __asm__ __volatile__ ("mftb %[tbr]" : [tbr] "=r" (tbr):);

  1. Calculate elapsed_ticks_since_boot as the difference between the TB register and tb_orig_stamp.

    elapsed_ticks_since_boot = tbr - cfg->tb_orig_stamp;

  1. Calculate elapsed_xseconds_since_boot as the product of elapsed_ticks_since_boot and tb_to_xs

    multiply(elapsed_ticks_since_boot, cfg->tb_to_xs,
    &elapsed_xseconds_since_boot_hi,
    &elapsed_xseconds_since_boot_lo);

    Note that it is necessary to use a helper multiply function. Both operands, elapsed_ticks_since_boot and cfg->tb_to_xs are 64 bits long, and their product results in a 128-bit number. Such a large value cannot be represented in a single C variable, thus the helper multiply function will split it in two parts, in this case called elapsed_xseconds_since_boot_hi and elapsed_xseconds_since_boot_lo. The details of the multiply algorithm will be described in the section headed Multiply as a support function.

  1. Calculate xseconds_since_epoch as the addition of elapsed_xseconds_since_boot_hi and stamp_xsec.

    xseconds_since_epoch =
    cfg->stamp_xsec + elapsed_xseconds_since_boot_hi;

  1. Calculate nanoseconds_since_epoch as xseconds_since_epoch converted to seconds dividing by xseconds_per_second (20-bit right shift) and converting the result to nanoseconds multiplying by nanoseconds_per_second. Note that the algorithm performs a multiply first and a divide (shift) later. This gives extra precision.
    multiply(xseconds_since_epoch, nanoseconds_per_second,
                  &nps_temp_hi,
                  &nps_temp_lo);
    
    nanoseconds_since_epoch_lo = nps_temp_lo >> 20;
    nanoseconds_since_epoch_hi = nps_temp_hi << (64 - 20);
    
    nanoseconds_since_epoch =
              nanoseconds_since_epoch_hi | nanoseconds_since_epoch_lo;
    

Note that most of the calculation is guarded against value changes to tb_update_count, which is stored in variables count_start and count_end. This is used to guarantee that the values read from tb_to_xs and stamp_xsec variables are consistent. The values for such variables can change when the operating system detects that the update frequency changed, or if the operating system initiated the frequency change.

Most of the calculation is guarded against value changes to tb_update_count, which is stored in the variables count_start and count_end. This is used to guarantee that the values read from tb_to_xs and stamp_xsec variables are consistent. The values for such variables can change when the operating system detects that the update frequency has changed, or if the operating system initiated the frequency change.

The Linux libc library explains how tb_update_count is updated and how its values must be interpreted:

The code which updates these variables increments tb_update_count, then updates the two variables and then increments tb_update_count again. This code reads tb_update_count, reads the two variables and then reads tb_update_count again. It loops doing this until the two reads of tb_update_count yield the same value and that value is even. This ensures a consistent view of the two variables (see Resources for a citation).

Furthermore, tb_update_count is used to detect when the TB register overflows. When that happens, the same protocol is followed; tb_update_count is updated while the necessary changes are reflected in the rest of the variables. It is the operating system's responsibility to detect such a scenario.

Finally, it is interesting to mention that a 64-bit nanosecond value can represent approximately 585 years of elapsed time.


Multiply as a support function

In the implementation of the Time Base calculation above, the algorithm uses a helper multiply function. Since the operands for the multiplication are 64-bit values, the result of their product is a 128-bit number. This result cannot be represented in a single variable in C, so the multiply function returns the result in two different output parameters.

The function is implemented as follows:


Listing 6. An optimized 64-bit multiply function
void multiply (U64 val1, U64 val2, U64 *res_hi, U64 *res_lo)
{
    U64 half1, half2;

    /* Even though we use 32 bits here, the compiler needs 64-bit
       registers */
    U64 half1_lo, half1_hi;

    /* Higher bits of val 1 */
    U64 val1_hi = (val1 >> 32) & 0xFFFFFFFF;

    /* Lower bits of val 1 */
    U64 val1_lo = val1 & 0xFFFFFFFF;

    /* Higher bits of val 2*/
    U64 val2_hi = (val2 >> 32) & 0xFFFFFFFF;

    /* Lower bits of val 2*/
    U64 val2_lo = val2 & 0xFFFFFFFF;

    U64 mult1          = val1_lo * val2_lo;
    U64 mult1_overflow = ((mult1 >> 32) & 0xFFFFFFFF);
    U64 mult2          =
           val1_hi * val2_lo + val2_hi * val1_lo + mult1_overflow;
    U64 mult2_overflow = ((mult2 >> 32) & 0xFFFFFFFF);

    half1_lo = (mult1 & 0xFFFFFFFF);
    half1_hi = (mult2 & 0xFFFFFFFF);

    half1    = (half1_hi << 32) | half1_lo;
    half2    = val1_hi * val2_hi + mult2_overflow;

    *res_hi = half2;
    *res_lo = half1;

The function receives two 64-bit input values, called val1 and val2. These are then separated in the hi and low halves, called: val1_hi, val1_lo, val2_hi and val2_lo. The multiplication is performed as:


Table 1

X
val1_hi val1_lo
val2_hi val2_lo
val1_hi x val2_hi +
mult2_overflow
val1_hi x val2_lo +
val2_hi x val1_lo +
mult1_overflow
val1_lo x val2_lo
half 2 mult2 mult1
half1

This follows the same idea as the decimal multiplication algorithm.

The lower part of the lower half of the product is val1_lo multiplied by val2_lo, which is called mult1. The high part of the lower half, called mult2, is val1_hi multiplied by val2_lo, plus val2_hi multiplied by val1_lo. Also, any overflow from the mult1 multiplication must be added to mult2.

The concatenation of the values of mult2 and mult1 will form the lower 64-bit half of the result, called half1.

The high half, called half2, is the result of multiplying val1_h1 and val2_hi, to which any overflow from mult2 must be added.

Another alternative is to use the multiply assembly instructions:


Listing 7. Assembly language version of multiply function
void multiply (U64 val1, U64 val2, U64 *high, U64 *low)
{
     U64 h;
     U64 l;
     __asm__ __volatile__("mulhdu %[h],%[val1],%[val2]" :
             [h] "=r" (h) : [val1] "g" (val1), [val2] "g" (val2));
     __asm__ __volatile__("mulld %[l],%[val1],%[val2]" :
             [l] "=r" (l) : [val1] "g" (val1), [val2] "g" (val2));

     *high = h;
     *low = l;
}
 

However, the performance delivered by this algorithm is not as good as its pure C equivalent. The IBM XL C compiler is a very good optimizer, and as a result, the C algorithm delivers superior results to a simple unoptimized assembly language implementation.

There are other variations that can be done in assembly (like using shifts and additions instead of multiply instructions), but they go beyond the scope of this article.


Advantages of the Time Base approach

There are two advantages to the implementation of nanosecond calculation using the Time Base register:

  • Performance: As you can see from the experimental results in the next section, the methods described here are considerably faster than simply invoking gettimeofday.
  • Precision: gettimeofday precision is fixed to the microsecond. To get a result expressed in nanoseconds, it is necessary to multiply by 1000, thus losing precision. The Time Base implementation already delivers the result in nanoseconds. Both implementations can easily support millisecond precision with an appropriate division operation.

Experimental evaluation

This section presents the performance results for three different algorithms. All tests were compiled with optimization level of -O3, using the IBM XL C compiler, and executed on RHEL Linux running on a p630 POWER4+™ microprocessor.

The algorithms compared below are:

  • tb: The implementation of nanosecond calculation presented above.
  • tb_libc: The libc implementation of gettimeofday using the TB register, but recompiled in a standalone program (some Linux implementations use this algorithm).
  • gtod: Calls to gettimeofday.

Figure 1 represents execution time (in microseconds) versus number of invocations.


Figure 1.

As the graph shows, the implementation presented in this article is the fastest, and it scales similarly to the recompiled libc implementation.

The tb algorithm is 20% faster than tb_libc, and 300% faster than gtod.

Using the gettimeofday interface causes a significant performance hit, and a much lower scalability. One possibility for this degradation is that the Linux libc libraries are not compiled using the appropriate optimization level.


Conclusion

In summary, then, you have seen how the PowerPC core (including Cell/B.E. variants) offers a convenient hardware feature for measuring elapsed time with fine precision, quickly and scalably. Example applications that might benefit from this sort of code include high-load transaction servers that need to timestamp individual events, and cryptographic applications where the time of day is included in message data to provide time sensitivity of data.


Resources

Learn

Get products and technologies

Discuss

About the author

Carlos Cavanna is a Software Developer in IBM Compilers Group, specializing in Java JIT (Java Just-In-Time) compilation for the PowerPC Architecture.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration, Linux
ArticleID=207051
ArticleTitle=The Power Architecture Time Base register in 64-bit Linux
publish-date=04042007
author1-email=ccavanna@ca.ibm.com
author1-email-cc=dwpower@us.ibm.com