From the stacks: PowerPC 750FX/GX design and debug tips, Part 2

Troubleshooting, debugging, and getting help

In the second and final installment of this series, IBM Senior Engineer Dale Elson offers comprehensive tips on troubleshooting and debugging your PowerPC 750FX/GX systems. Part 1 focused on design best practices. This article also covers system qualification and lists the information to communicate when you need debugging help.

Share:

Dale Elson, Senior Engineer, IBM PowerPC Applications Engineering

Dale Elson is an IBM senior engineer and the lead applications engineer for the PowerPC 750 family. He has been helping customers design and debug PowerPC systems since the 601 was designed and has written many of the PowerPC books, manuals, and documents that are available in the IBM Technical Library.



12 October 2004

Part 1 in this series offered a series of tips on system design with the PowerPC® 750FX and 750GX processors. Part 2 assumes that you designed your system with the tips from Part 1 in mind. But even the best-designed systems can run into trouble. This article focuses now on troubleshooting and debugging techniques that work well with PowerPC. It offers hints on performing system qualification of PowerPC systems and the standard "shopping list" of information that the IBM PowerPC Applications Engineering team uses to help customer and field applications engineer (FAE) teams debug their systems.

As discussed in Part 1, every team has its own way of designing, debugging, and troubleshooting boards or systems. This article isn't intended either to teach basic techniques, or to dictate particular techniques to experienced teams. Also, PowerPC systems are diverse in nature and application, so some of this discussion might not apply to a particular system.

If the board doesn't boot

If the prototype board doesn't boot, then something foundational is probably badly wrong. The good news is that these things are usually relatively easy to identify.

Check part numbers

Often, the cause of a prototype not booting is that the wrong part(s) have been installed. So if the system does not boot, check the part numbers of the processor and all of the parts associated with the processor.

Try another one

In a significant number of development boards, some of the components have been damaged by mishandling or misadventure before debugging of the processor bus has begun. If only one board is being brought up, a good technique is to bring up another board, or replace the component that seems to be failing. This can eliminate a long hunt for system problems when the cause is actually a broken part.

Check power-supply voltages

Check that all of the power supply voltages (Vdd, OVdd, AVdd) are in spec according to the datasheet (or datasheet supplement) for the particular part and for the application conditions specified by the part number. Check that the noise and ripple don't cause the power supply to violate the minimum or maximum voltages for the part and the application conditions. Note that if the instantaneous value of Vdd drops below Vdd(min) for that frequency and junction temperature, the part might not operate correctly. Only a +/-15mv allowance for ripple and noise is built into the Vdd specification. See the datasheet for the part and the IBM application note PowerPC 750FX and 750GX Power Dissipation for more details.

Because the maximum frequency of the processor is limited by the value of Vdd, verify that the correct Vdd has been chosen.

Verify that the correct components have been installed for the AVdd filter (see the datasheet). Verify that the correct Vdd and OVdd bypassing is installed. See the IBM application note PowerPC 750FX Power Supply Layout and Bypassing for more details (see Resources).

Verify that the power supplies and bus voltages remain in compliance with the power-supply envelopes specified by the notes to the "Absolute Maximum Ratings" table in the datasheet. Violation of these envelopes can cause functional fails and can also cause damage to the processor. In the initial stages, this damage is difficult to detect in the field and is usually only detectable on IBM testers.

Check resets

Check the resets (HRESET#, SRESET#, TRST#) during power up. Generally, SRESET# is pulled high during reset, and HRESET# and TRST# are driven low. Verify that HRESET# is asserted for a sufficient amount of time after SYSCLK and the power supplies stabilize. See the datasheet for more details.

Check strapping options

Verify that the strapping pins are in the desired state preceding the deassertion of HRESET#. The required timing is shown in the Datasheet, and the correct logic level is shown in the User?s Manual for most pins (PLL_CFG is typically shown in the Datasheet). The strapping pins typically vary somewhat from processor to processor, so verify that an incorrect legacy pin strapping did not creep into the current design.

Verify that the timing of the strapping pins is correct. Strapping pins change function after the deassertion of HRESET#; they change from the strapping-pin function to the normal function. These pins generally must assume the correct level for normal operation within a couple SYSCLK cycles of the deassertion of HRESET#.

Check SYSCLK

Verify that SYSCLK is valid for the required amount of time and cycles before the deassertion of HRESET# (see the section, Check resets). Verify that SYSCLK is running at the intended frequency. Check SYSCLK signal quality. Verify that the jitter and skew are within datasheet specifications.

Examine bus activity

Is anything happening on the bus? Verify that the bus is idle at the deassertion of HRESET#, and that this state is followed by a valid address bus arbitration. If the processor does not request the bus, or has BG# and never asserts TS#, then concentrate on the processor. If TS# is issued, but there's no response (or an incorrect response) from the system, then concentrate on the system logic.

Give it a good whack

Occasionally, the appropriate application of a kinetic shock will restore connectivity to connectors and contacts that have opened because of contamination, corrosion, or other causes.


If the board fails during normal operation

If the prototype boots, but fails during operation, consider the following ideas in addition to the tips in the If the board doesn't boot section of this article.

Take measurements close to the point of failure

In some cases, the root cause of the problem is a condition that exists for a long time and causes a malfunction only with certain processors or under certain conditions (code, environment, internal state, and so on). In other cases, the condition will immediately trigger a fail. So in all cases, you should take measurements as close to the point of fail (POF) -- the point at which the failure occurs -- as possible.

Identify the POF

The POF could be a point in time, a point in the code, a point at which the machine is in a certain internal state, or a point that corresponds to an asynchronous event, such as an interrupt, an electrical transient, or a certain pattern in IO data. Whatever the case, it is usually important to identify the POF in whatever terms are appropriate, because once the POF is known, the failure can usually be isolated and identified.

Capture the data

Once the POF is known, you can investigate the conditions that cause the fail. Generally, this requires a huge amount of initial data collection. Usually you need to instrument the control and address busses, and occasionally the data bus. The activity on the bus often provides the key to understanding the fail. Look for the point at which the bus protocol breaks, or the bus hangs, or simply goes idle.

Check the application conditions at the point of fail

Since the performance of the processor is limited by the application conditions, verify Vdd, OVdd, Fcore, and Tj at the POF.

Power dissipation in 750FX/GX systems is a major challenge, and many false fails have resulted from omitting the heatsink during debug. It takes the 750GX about two seconds to overheat without a heatsink, so put the heatsink back on! See the datasheet for more information.

Verification of the junction temperature with a heatsink in place can be challenging. It is usually possible to drill a small hole in the heatsink close to the die surface, so that a small diameter thermocouple (or other sensor) can be placed very close to the surface of the die to measure Tj at the POF. Depending on the merit of the design, there might be a small temperature drop across the heatsink material between the die and the sensor, but this is usually not a big factor. If the measured temperature is anywhere close to Tj(max) for the part, then further investigation is required.

Check for asynchronous events at the POF

Check for any correspondence of the fail with asynchronous events. This is especially appropriate if the fail happens at a different point in the code each time, or on a different iteration of the code, or only happens if the system is in a certain state or handling certain loads. Look for interrupts, other parts of the board that are doing unusual things, possible noise or transient sources, and anything else that might correspond to the fails.

Check the timing and signal quality

If all else fails, a signal quality and timing analysis occasionally reveals the root cause of a failure. It is best to perform this analysis at the POF, in order to catch a signal that is only occasionally out of spec, or a transient that only occasionally couples onto a net, causing a fail.

In general, prioritize the bus in terms of signals that are most likely to cause the fail. For a bus hang, for example, this would be the control lines first, and then the address bus. Verify that the signals arrive at the receiver with sufficient signal integrity and timing to satisfy the setup and hold times of the receiving bus agent. Check the signals against the bus clock at the receiver.

Check the bus clocks very carefully for cycle-to-cycle jitter, Fmax, and skew, all of which subtract from the timing budget. Be sure to verify the hold timing, since this is occasionally overlooked.

Look for strange waveforms

Many times, an experienced engineer or technician can spot a signal abnormality by examination of the bus signals on a scope. Bus contentions and unexpected tristates can be identified in this way. Similarly, occasionally putting pullup or pulldown resistors on a net can reveal abnormal activity.

Physically examine the board

Many circuit board errors have been found by inspection. Visually examining a board under bright light and magnification will often reveal solder shorts or opens, misalignments, and other physical defects.

CBGA packages can be examined using x-ray, tomographic, or other imaging techniques that reveal the structure of nonvisible connections.


If you must, go fishing

If a root cause of a failure is not forthcoming, you can develop information about the failure by varying a number of factors and observing the effect on the failure rate, in the following ways.

Make the failure happen more frequently

Things that make the failure happen more frequently give clues to the weakness in the system. For example, if raising the junction temperature makes the failure happen sooner, then the problem might be an input setup time or maximum frequency. (Raising Tj slows the logic.) Or if changing the code makes the fail happen more reliably, then an analysis of that part of the code could hold clues to the failure mechanism.

Make the failure happen less frequently

Similarly, changes that cause the failure to occur less frequently can often give direct clues to the nature of the failure. For example, if raising Vdd eliminates the failure, then Vdd should be examined for compliance to the specification.

Find changes that don't seem to affect the failure

Changes that do not seem to affect the failure also contain useful information. For example, if raising Vdd, lowering Tj, and reducing f(SYSCLK) have no effect on the failure, then it probably isn't related to the application conditions. It's probably a functional failure, and your team's main attention should be on analysis of the schematic, the bus activity, and the code.

Vary the processor core conditions

The maximum operating frequency of a given processor is affected by the inherent speed of the silicon, the temperature of the processor, and the core power supply voltage. Problems with marginal application conditions can often be investigated by raising and lowering Vdd and observing the effect on the failure.

Similarly, you can raise and lower the junction temperature and observe the effect on the fail. You can also raise and lower the core frequency, either by varying f(SYSCLK) or by changing the clock multiplier.

If lowering the junction temperature, raising Vdd, or decreasing Fcore eliminates the fail, then you should investigate a marginality in one of these factors. Likewise, if raising Tj, or lowering Fcore or Vdd causes the failure to occur more often, you should explore the same fault.

Vary the processor bus conditions

If the fail seems to be affected by changing the bus frequency or OVdd, and not so much by changing Tj or Vdd, then you should investigate a signal quality or timing issue. For example, if halving f(SYSCLK) and doubling the bus frequency multiplier eliminates the fails, then you should suspect the bus.

If system timing is suspect, then the bus clock to the various bus agents can be delayed to test the timing. For example, if the hold time from the processor (output hold) to the bridge (input hold) is suspect, insert a length of wire into the bus clock to the processor. This delays the processor signals in time, thus extending the hold time supplied to the bridge.

Of course, delaying the clock to one bus agent also increases that agent's output valid time (among other things), so this one experiment stresses a number of factors.

Vary the bus logic conditions

Just as is the case with the processor, the core and IO power supplies to the other bus agents can also be varied, and the effects on the system similarly analyzed.


PowerPC system qualification

System qualification usually seeks to gain confidence in the operation of the system under normal conditions by testing the system over a range of conditions that exceed the expected operational envelope in every parameter. To avoid confusion, system qualification is best done after the software is debugged.

The major factors to vary in order to stress the design are:

  • Junction temperature: Usually this is best done by varying the temperature of the whole board at the same time, as would be the case in a production system that was experiencing thermal stress. Remember that exceeding the core temperature range might cause false fails. Also consider that raising Tj above 105C can cause device damage, so modules that are stressed above 105C should not be shipped.
  • Core frequency: Consider varying the core:SYSCLK ratio during testing, as well as varying the SYSCLK frequency. Depending on the requirements of the system, different design teams choose different frequency guardbands. One team might choose to vary the frequency 5% during qualification testing, and another group might vary the frequency 10%.
  • 60x bus frequency: Bus frequency is not usually a significant factor for 100MHz or 133MHz bus clock designs, but with the emergence of 166MHz and 200MHz designs, SYSCLK guardbanding has become increasingly important. In general, changing the core:bus ratio does not affect the IO timing or the operation of the IO logic, so it is valid to decrease the core:SYSCLK ratio, to be able to vary SYSCLK up and down without exceeding the core frequency envelope.
  • Supply voltages: Extensive testing should be done on the supply voltages. Vdd has an effect on the maximum core frequency, and also on the IO timing. OVdd affects the IO timing.

Help tech support help you

The IBM PowerPC Applications Engineering team often helps customers with difficult problems encountered during system debug. To do so, we need information about the system, the environment, and the characteristics of the fail. Without the relevant information, we can only guess at the source of the problem... and that usually doesn't work very well. Here's a starter list of the information that helps IBM help customers solve their debug problems:

  • Softcopy block diagram of the system, schematics of the CPU bus, and anything else that touches the CPU; the logic levels of the strapping pins leaving hreset
  • Softcopy logic analyzer traces of the busses at the POF; Please sample the bus on the rising edge of SYSCLK.
  • The code that was running at the POF
  • The content of the CPU registers at the POF
  • Environment at the POF; actual Tj, Fcore, F(SYSCLK), Vdd, OVdd; high-resolution scope traces of the power supplies
  • Verification that the system timing has been calculated correctly
  • Scope traces of HRESET#, TRST#, SYSCLK, Vdd, and OVdd during power up
  • The results of the signal quality and timing study on the system at the POF
  • Scope traces of relevant signals around the point of fail, to double-check the LA
  • Frequency and circumstances of the fail; Try to give us as much information as possible.
  • The maturity of the project, and the history of the failing board(s):
    • Is this is a prototype?
    • How many other boards like this exist? What are their histories?
    • How many boards fail production line testing?
    • How many fail box/system test?
    • How many fail in the field?
    • Describe the circumstances of the fails in each case.

Conclusion

Implementing the best practices outlined in this series in your PowerPC 750FX and 750GX projects should spare your team a world of time and hurt. And be sure to visit the PowerPC Technical Library, where you'll find an application note with this same information and much more that's specific to 750FX and 750GX processors.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=32239
ArticleTitle=From the stacks: PowerPC 750FX/GX design and debug tips, Part 2
publish-date=10122004