Part 1 in this series offered a series of tips on system design with the PowerPC® 750FX and 750GX processors. Part 2 assumes that you designed your system with the tips from Part 1 in mind. But even the best-designed systems can run into trouble. This article focuses now on troubleshooting and debugging techniques that work well with PowerPC. It offers hints on performing system qualification of PowerPC systems and the standard "shopping list" of information that the IBM PowerPC Applications Engineering team uses to help customer and field applications engineer (FAE) teams debug their systems.
As discussed in Part 1, every team has its own way of designing, debugging, and troubleshooting boards or systems. This article isn't intended either to teach basic techniques, or to dictate particular techniques to experienced teams. Also, PowerPC systems are diverse in nature and application, so some of this discussion might not apply to a particular system.
If the board doesn't boot
If the prototype board doesn't boot, then something foundational is probably badly wrong. The good news is that these things are usually relatively easy to identify.
Check part numbers
Often, the cause of a prototype not booting is that the wrong part(s) have been installed. So if the system does not boot, check the part numbers of the processor and all of the parts associated with the processor.
Try another one
In a significant number of development boards, some of the components have been damaged by mishandling or misadventure before debugging of the processor bus has begun. If only one board is being brought up, a good technique is to bring up another board, or replace the component that seems to be failing. This can eliminate a long hunt for system problems when the cause is actually a broken part.
Check power-supply voltages
Check that all of the power supply voltages (Vdd, OVdd, AVdd) are in spec according to the datasheet (or datasheet supplement) for the particular part and for the application conditions specified by the part number. Check that the noise and ripple don't cause the power supply to violate the minimum or maximum voltages for the part and the application conditions. Note that if the instantaneous value of Vdd drops below Vdd(min) for that frequency and junction temperature, the part might not operate correctly. Only a +/-15mv allowance for ripple and noise is built into the Vdd specification. See the datasheet for the part and the IBM application note PowerPC 750FX and 750GX Power Dissipation for more details.
Because the maximum frequency of the processor is limited by the value of Vdd, verify that the correct Vdd has been chosen.
Verify that the correct components have been installed for the AVdd filter (see the datasheet). Verify that the correct Vdd and OVdd bypassing is installed. See the IBM application note PowerPC 750FX Power Supply Layout and Bypassing for more details (see Resources).
Verify that the power supplies and bus voltages remain in compliance with the power-supply envelopes specified by the notes to the "Absolute Maximum Ratings" table in the datasheet. Violation of these envelopes can cause functional fails and can also cause damage to the processor. In the initial stages, this damage is difficult to detect in the field and is usually only detectable on IBM testers.
Check the resets (
power up. Generally,
SRESET# is pulled high
during reset, and
TRST# are driven low. Verify that
HRESET# is asserted for a sufficient amount of time
SYSCLK and the power supplies stabilize.
See the datasheet for more details.
Check strapping options
Verify that the strapping pins are in the desired state preceding the
HRESET#. The required timing is shown in the Datasheet, and the correct logic level is shown in the User?s Manual for most pins (PLL_CFG is typically shown in the Datasheet). The strapping pins typically vary somewhat from processor to processor, so verify that an incorrect legacy pin strapping did not creep into the current design.
Verify that the timing of the strapping pins is correct. Strapping pins
change function after the deassertion of
HRESET#; they change from the strapping-pin function
to the normal function. These pins generally must assume the correct level
for normal operation within a couple
cycles of the deassertion of
SYSCLK is valid for the required amount of time and cycles before the deassertion of
HRESET# (see the section, Check resets).
SYSCLK is running at the intended
SYSCLK signal quality. Verify
that the jitter and skew are within datasheet specifications.
Examine bus activity
Is anything happening on the bus? Verify that the bus is idle at the
HRESET#, and that this state is
followed by a valid address bus arbitration. If the processor does not
request the bus, or has
BG# and never asserts
TS#, then concentrate on the processor. If
TS# is issued, but there's no response (or an
incorrect response) from the system, then concentrate on the system
Give it a good whack
Occasionally, the appropriate application of a kinetic shock will restore connectivity to connectors and contacts that have opened because of contamination, corrosion, or other causes.
If the board fails during normal operation
If the prototype boots, but fails during operation, consider the following ideas in addition to the tips in the If the board doesn't boot section of this article.
Take measurements close to the point of failure
In some cases, the root cause of the problem is a condition that exists for a long time and causes a malfunction only with certain processors or under certain conditions (code, environment, internal state, and so on). In other cases, the condition will immediately trigger a fail. So in all cases, you should take measurements as close to the point of fail (POF) -- the point at which the failure occurs -- as possible.
Identify the POF
The POF could be a point in time, a point in the code, a point at which the machine is in a certain internal state, or a point that corresponds to an asynchronous event, such as an interrupt, an electrical transient, or a certain pattern in IO data. Whatever the case, it is usually important to identify the POF in whatever terms are appropriate, because once the POF is known, the failure can usually be isolated and identified.
Capture the data
Once the POF is known, you can investigate the conditions that cause the fail. Generally, this requires a huge amount of initial data collection. Usually you need to instrument the control and address busses, and occasionally the data bus. The activity on the bus often provides the key to understanding the fail. Look for the point at which the bus protocol breaks, or the bus hangs, or simply goes idle.
Check the application conditions at the point of fail
Since the performance of the processor is limited by the application conditions, verify Vdd, OVdd, Fcore, and Tj at the POF.
Power dissipation in 750FX/GX systems is a major challenge, and many false fails have resulted from omitting the heatsink during debug. It takes the 750GX about two seconds to overheat without a heatsink, so put the heatsink back on! See the datasheet for more information.
Verification of the junction temperature with a heatsink in place can be challenging. It is usually possible to drill a small hole in the heatsink close to the die surface, so that a small diameter thermocouple (or other sensor) can be placed very close to the surface of the die to measure Tj at the POF. Depending on the merit of the design, there might be a small temperature drop across the heatsink material between the die and the sensor, but this is usually not a big factor. If the measured temperature is anywhere close to Tj(max) for the part, then further investigation is required.
Check for asynchronous events at the POF
Check for any correspondence of the fail with asynchronous events. This is especially appropriate if the fail happens at a different point in the code each time, or on a different iteration of the code, or only happens if the system is in a certain state or handling certain loads. Look for interrupts, other parts of the board that are doing unusual things, possible noise or transient sources, and anything else that might correspond to the fails.
Check the timing and signal quality
If all else fails, a signal quality and timing analysis occasionally reveals the root cause of a failure. It is best to perform this analysis at the POF, in order to catch a signal that is only occasionally out of spec, or a transient that only occasionally couples onto a net, causing a fail.
In general, prioritize the bus in terms of signals that are most likely to cause the fail. For a bus hang, for example, this would be the control lines first, and then the address bus. Verify that the signals arrive at the receiver with sufficient signal integrity and timing to satisfy the setup and hold times of the receiving bus agent. Check the signals against the bus clock at the receiver.
Check the bus clocks very carefully for cycle-to-cycle jitter, Fmax, and skew, all of which subtract from the timing budget. Be sure to verify the hold timing, since this is occasionally overlooked.
Look for strange waveforms
Many times, an experienced engineer or technician can spot a signal abnormality by examination of the bus signals on a scope. Bus contentions and unexpected tristates can be identified in this way. Similarly, occasionally putting pullup or pulldown resistors on a net can reveal abnormal activity.
Physically examine the board
Many circuit board errors have been found by inspection. Visually examining a board under bright light and magnification will often reveal solder shorts or opens, misalignments, and other physical defects.
CBGA packages can be examined using x-ray, tomographic, or other imaging techniques that reveal the structure of nonvisible connections.
If you must, go fishing
If a root cause of a failure is not forthcoming, you can develop information about the failure by varying a number of factors and observing the effect on the failure rate, in the following ways.
Make the failure happen more frequently
Things that make the failure happen more frequently give clues to the weakness in the system. For example, if raising the junction temperature makes the failure happen sooner, then the problem might be an input setup time or maximum frequency. (Raising Tj slows the logic.) Or if changing the code makes the fail happen more reliably, then an analysis of that part of the code could hold clues to the failure mechanism.
Make the failure happen less frequently
Similarly, changes that cause the failure to occur less frequently can often give direct clues to the nature of the failure. For example, if raising Vdd eliminates the failure, then Vdd should be examined for compliance to the specification.
Find changes that don't seem to affect the failure
Changes that do not seem to affect the failure also contain useful
information. For example, if raising Vdd, lowering Tj, and reducing f
(SYSCLK) have no effect on the failure, then it
probably isn't related to the application conditions. It's probably a
functional failure, and your team's main attention should be on analysis
of the schematic, the bus activity, and the code.
Vary the processor core conditions
The maximum operating frequency of a given processor is affected by the inherent speed of the silicon, the temperature of the processor, and the core power supply voltage. Problems with marginal application conditions can often be investigated by raising and lowering Vdd and observing the effect on the failure.
Similarly, you can raise and lower the junction temperature and observe the
effect on the fail. You can also raise and lower the core frequency, either by varying f
(SYSCLK) or by
changing the clock multiplier.
If lowering the junction temperature, raising Vdd, or decreasing Fcore eliminates the fail, then you should investigate a marginality in one of these factors. Likewise, if raising Tj, or lowering Fcore or Vdd causes the failure to occur more often, you should explore the same fault.
Vary the processor bus conditions
If the fail seems to be affected by changing the bus frequency or OVdd,
and not so much by changing Tj or Vdd, then you should investigate a signal quality or timing issue. For example, if halving f
(SYSCLK) and doubling the bus frequency multiplier
eliminates the fails, then you should suspect the bus.
If system timing is suspect, then the bus clock to the various bus agents can be delayed to test the timing. For example, if the hold time from the processor (output hold) to the bridge (input hold) is suspect, insert a length of wire into the bus clock to the processor. This delays the processor signals in time, thus extending the hold time supplied to the bridge.
Of course, delaying the clock to one bus agent also increases that agent's output valid time (among other things), so this one experiment stresses a number of factors.
Vary the bus logic conditions
Just as is the case with the processor, the core and IO power supplies to the other bus agents can also be varied, and the effects on the system similarly analyzed.
PowerPC system qualification
System qualification usually seeks to gain confidence in the operation of the system under normal conditions by testing the system over a range of conditions that exceed the expected operational envelope in every parameter. To avoid confusion, system qualification is best done after the software is debugged.
The major factors to vary in order to stress the design are:
- Junction temperature: Usually this is best done by varying the temperature of the whole board at the same time, as would be the case in a production system that was experiencing thermal stress. Remember that exceeding the core temperature range might cause false fails. Also consider that raising Tj above 105C can cause device damage, so modules that are stressed above 105C should not be shipped.
- Core frequency: Consider varying the core:SYSCLK ratio during
testing, as well as varying the
SYSCLKfrequency. Depending on the requirements of the system, different design teams choose different frequency guardbands. One team might choose to vary the frequency 5% during qualification testing, and another group might vary the frequency 10%.
- 60x bus frequency: Bus frequency is not usually a significant
factor for 100MHz or 133MHz bus clock designs, but with the emergence of 166MHz and 200MHz designs,
SYSCLKguardbanding has become increasingly important. In general, changing the core:bus ratio does not affect the IO timing or the operation of the IO logic, so it is valid to decrease the core:SYSCLK ratio, to be able to vary
SYSCLKup and down without exceeding the core frequency envelope.
- Supply voltages: Extensive testing should be done on the supply voltages. Vdd has an effect on the maximum core frequency, and also on the IO timing. OVdd affects the IO timing.
Help tech support help you
The IBM PowerPC Applications Engineering team often helps customers with difficult problems encountered during system debug. To do so, we need information about the system, the environment, and the characteristics of the fail. Without the relevant information, we can only guess at the source of the problem... and that usually doesn't work very well. Here's a starter list of the information that helps IBM help customers solve their debug problems:
- Softcopy block diagram of the system, schematics of the CPU bus, and anything else that touches the CPU; the logic levels of the strapping pins leaving hreset
- Softcopy logic analyzer traces of the busses at the POF; Please sample the bus on the rising edge of
- The code that was running at the POF
- The content of the CPU registers at the POF
- Environment at the POF; actual Tj, Fcore, F
(SYSCLK), Vdd, OVdd; high-resolution scope traces of the power supplies
- Verification that the system timing has been calculated correctly
- Scope traces of
SYSCLK, Vdd, and OVdd during power up
- The results of the signal quality and timing study on the system at the POF
- Scope traces of relevant signals around the point of fail, to double-check the LA
- Frequency and circumstances of the fail; Try to give us as much information as possible.
- The maturity of the project, and the history of the failing board(s):
- Is this is a prototype?
- How many other boards like this exist? What are their histories?
- How many boards fail production line testing?
- How many fail box/system test?
- How many fail in the field?
- Describe the circumstances of the fails in each case.
Implementing the best practices outlined in this series in your PowerPC 750FX and 750GX projects should spare your team a world of time and hurt. And be sure to visit the PowerPC Technical Library, where you'll find an application note with this same information and much more that's specific to 750FX and 750GX processors.
- Read Part 1 of this series, "System design", for more information on designing, troubleshooting, and debugging PowerPC systems (developerWorks, October 2005).
- This article series is based on the 750FX-GX Design/Debug Tips application note from the IBM Technology Group Library.
- You'll find much of the documentation referenced in this article -- including the PowerPC 750FX Microprocessor User's Manual, the IBM PowerPC 750GX RISC Microprocessor User's Manual, datasheets, the PowerPC Architecture Book, specifications and reference designs, errata, FAQs, and more -- at the PowerPC Technical Library as well.
- The PowerPC 750FX Power Supply Layout and Bypassing application note describes design considerations for power and ground connectivity, and bypass capacitor selection and placement, for the IBM PowerPC 750FX and 750GX microprocessors.
- The PowerPC 750FX Evaluation Kit and PowerPC 750GX Evaluation Kit support benchmarking, reference design, prototyping, and software development. They include all the hardware, software, and documentation required to provide decision support for your design process.
- Learn more about factors affecting power dissipation from the 750FX and 750GX Power Dissipation Presentation.
- Find links to PowerPC training, documentation, pricing, and tech support at the PowerPC processors Technical support page.
- Have experience you'd be willing to share with Power Architecture zone readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside IBM are welcomed. Check out the Power Architecture author FAQ to learn more.
- Have a question or comment on this story, or on Power Architecture technology in general? Post it in the Power Architecture technical forum or send in a letter to the editors.
- Get a subscription to the Power Architecture Community Newsletter when you Join the Power Architecture community.
- All things Power are chronicled in the developerWorks Power Architecture editors' blog, which is just one of many developerWorks blogs.
- Find more articles and resources on Power Architecture technology and all things related in the developerWorks Power Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated environment, or just to explore the fully licensed version of Power Architecture technology.
Dig deeper into developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.