Level: Intermediate Neil Leeder (neileedr@us.ibm.com), Software Engineer, IBM
21 Nov 2006 Explore the register-level details of tuning the CPC945's double data rate 2 (DDR2) memory controller for specific hardware implementations. Author Neil Leeder introduces the CPC945's nifty self-calibrating hardware features, helps you learn what needs to be tuned, and offers useful heuristics for how to tune it.
Introduction to the CPC945
The IBM CPC945 is a North Bridge for use with the family of PowerPC® 970 processors, such as 970FX and 970MP. Along with features which include an interrupt controller, HyperTransport and PCI-Express bridges, it also contains a double data rate 2 (DDR2) memory controller.
Overview of the CPC945 memory controller
The memory controller supports up to eight DIMMs, running at a speed of up to 266MHz, producing 533 Mega transfers per second, with optional error correcting code (ECC).
One of the innovative features of the memory controller is the inclusion of auto-calibration hardware. A memory controller needs to be calibrated for the unique characteristics of each system in which it is used. Details such as the distance from the controller to the memory slots on the board will affect the timing of memory operations. The process of aligning the memory transactions to the memory controller is known as "tuning." With many memory controllers, the tuning process is manual, involving trial and error. The CPC945, however, contains auto-calibration hardware which, when used with associated software, eases the tuning process. This allows the determination of optimal tuning values with more accuracy and reduces the bring-up time of new board designs.
DIMMs are commonly accessed in pairs. Excluding ECC lanes, each DIMM is 64 bits wide, so a 128-bit access spans a lower (bits 0-63) and upper (bits 64-127) DIMM. Each board may have up to four lower and four upper DIMMs on it. Due to the physical position of the DIMMs on the board, the distance from the CPC945 to the lower set of DIMMs might be significantly different from the distance to the upper set. To accommodate this difference, some tuning parameters are duplicated so that there is one value for the lower DIMMs and one for the upper set. For some of the finer tuning parameters, there are also multiple values which can be tuned on a bytelane basis, allowing for the differences between each end of the DIMM.
Tuning steps
Several signals need to be aligned with each other to ensure successful operation. Some of these you can easily view with a tool such as an oscilloscope or logic analyzer. Programming values into the appropriate CPC945 control registers will alter delays associated with the signals, which will cause them to move, allowing you to see the changes using the instrument. Other delays occur internally to the CPC945 and are not viewable externally. Auto-calibration aids in the tuning process of these types of configurations.
Aligning the clock to chip select
The first stage in tuning the controller to the board is to ensure that the memory clock lines up with the chip select signal that is used when accessing a DIMM. You can use an oscilloscope or logic analyzer to display the alignment. The goal is for the rising edge of the clock to be close to the center of chip select. In practice, it's preferable if the rising edge is slightly to the right of center, allowing a longer set-up time, as Figure 1 shows. Two registers can be used to adjust the centering of the clock. CKDelayL controls the offset for the lower memory banks, while CKDelayU is for the upper banks. By increasing or decreasing the values in the CKDelta field, the relationship between the clock edge and the chip select can be altered until it is aligned as described above.
Figure 1
The problem of aligning read data
When the memory controller sends out a read request to memory, it can't determine precisely when the read data will arrive back. Although you can calculate the time needed for the DIMM to process the read, some other factors affect the time of the complete transaction. These include such items as memory chip and board wiring delays. For this reason, data is accompanied by a strobe signal. The edges of the strobe determine when to sample the data.
The data and strobe lines are bidirectional, being driven by either the controller or the memory. After the memory controller sends out its read request, it stops driving the data and strobe lines to the memory. The memory won't drive the lines until it is ready to send back the data, so for a period, neither side is controlling the lines, and during this time they are susceptible to noise. The controller needs to monitor the strobe lines to determine when data is present. Figure 2 shows the period preceding the strobes when the memory will drive the strobe lines low, called the read preamble. In this time, the controller will start looking for the strobes. If it starts looking too early, before the preamble, then it may inadvertently see noise on the undriven line and mistake it for a strobe. If it waits too long, until after the preamble, then it will miss seeing the first strobe. So the next step in tuning is to get the controller to start looking for strobes during the read preamble. This delay is known as the Reset Multiplexer Delay Select, commonly abbreviated ResMuxDel.
Figure 2
On some older memory controllers, the way to determine a valid ResMuxDel value is by trial and error. A value is calculated, then data is written to the DIMM and readback is attempted.
One problem with this approach is that you must be sure that the data is being written correctly to the DIMM in the first place, which means that the tuning of data writes must have been completed successfully. Of course, in order to determine that a write is successful means reading the data back, and this presents a problem if reads have not yet been tuned.
A second problem is that it is optimal to pick a ResMuxDel value which is centered in the read preamble window. Picking a value that "just works" on one board might mean that it is close to the edge of the read preamble. Due to variance in individual components, it is possible that the value could fail on other boards, in different temperature conditions, or with slight voltage changes.
The CPC945 with its auto-calibration features addresses both of these problems.
Coarse read delay setting
The delay before starting to look for the read strobe is programmed in two parts. ResMuxDel is a global value that applies to all parts of all DIMMs, and is known as the coarse setting. There is also a fine setting which further refines the delay and has more granularity, allowing different settings across DIMMs and across byte lanes. The next section covers the fine setting.
To test all possible ranks on the board, all DIMM slots should be populated with dual rank memory for the following tests, with the DmCnfg (DIMM configuration) registers accurately describing the installed memory.
The auto-calibration hardware in the CPC945 memory controller can be used to determine the ResMuxDel values which work for each rank of each DIMM. The read preamble is two clocks long and the ResMuxDel delay is measured in clocks, so there are typically two values that will work for each rank. It is common to find that the same two values work for all DIMMs, although it is possible that different delays are determined for some DIMMs. For example, due to the varying distances between the DIMM slot and the CPC945, some DIMMs on the board may have possible ResMuxDel values of 8 or 9, while another DIMM may have values of 9 and 10. At least one common value will always work with all DIMMs; in this case it would be 9.
The auto-calibration hardware allows selection of each possible ResMuxDel value for each rank. Each rank can be sequentially tested with every possible ResMuxDel value and can then report the values that passed the test. Special hardware inside the controller allows the data lane to be tested without having to read back a known data value. Instead, only the data strobe is used. The strobe signal is split, and one half is delayed by a half clock. The undelayed signal is then used as "data," while the delayed strobe is used as normal to sample the data, as Figure 3 shows. When the controller successfully reads four data bits with the pattern 1010, it can tell that it has found the genuine strobe signal and has passed the test. In this way, the memory contents returned on the read are ignored, so the problem of having to write known data into memory before reading it out again has been avoided.
Figure 3
Find sample software to perform the coarse read delay calibration in Resources.
Fine read delay setting
After the coarse ResMuxDel values have been determined, the next task is to choose the value that is as close as possible to the center of the read preamble. The CPC945 memory controller has a set of vernier registers that divide the distance between ResMuxDel values into 256 smaller increments. There are a total of eight vernier settings, four for each of the lower and upper DIMMs. There are nine byte lanes in each DIMM (eight for data and one for ECC), so each vernier setting covers two or three adjacent byte lanes. Adjacent byte lanes will have similar distances from the DIMM to the CPC945, so they can easily share the same vernier value.
Although the goal is to determine the center of the read preamble, the vernier auto-calibration works by finding the edge of the preamble. When that is known, the center can be easily determined. First, the higher of the two possible ResMuxDel values found previously is programmed into the MemBusConfig[ResMuxDel] register field. Then, for one byte lane, the auto-calibration hardware repeatedly increments the vernier value, starting at zero, until it fails to read a successful data pattern. This indicates that the highest acceptable vernier value has been reached for this byte lane. The process is repeated for all byte lanes across all ranks. When all values are known, the values are averaged across ranks -- and for adjacent byte lanes -- to arrive at a value which can be programmed into the vernier registers. The registers are RstLdEnVerniersC0-4, and each contains two fields. RstLdEnVerniersC0 contains one field for byte lanes 0 and 1, and a second field for lanes 2 & 3. The first field in RstLdEnVerniersC1 is for lanes 4, 5, and 16, which is the ECC lane for the lower DIMM, while the second field is for lanes 6 & 7. Similar fields are used in RstLdEnVerniersC2 & 3 for the upper byte lanes 8 through 15, plus ECC lane 17.
To calculate the value to be programmed for the field containing the offset for lanes 0 & 1, sum the values for lane 0 for all ranks, along with the values for all ranks for lane 1, and divide by the total number of values used. Program this number into the field in RstLdEnVerniersC0. Repeat for all other similar fields with the appropriate byte lane values. This can easily be done in software as part of the function which runs the tests above.
At this point, we have found the edge of the read preamble window. Recall that the window is two clocks long, so to find its center you need to subtract exactly one clock from this edge. When this test started, you set the MemBusConfig[ResMuxDel] field to its highest value. To move it back one clock, simply decrement the ResMuxDel value by 1, while leaving all the vernier values that were just programmed untouched.
This addresses the second problem that was identified earlier of ensuring that the controller starts looking for strobes in the center of the read preamble.
Find sample software to perform the fine read delay calibration in Resources.
Write tuning
Tuning the controller to get data written out to the DIMMs to work correctly is very similar to the methods used in the CPC925, which were described in a previous article. Tuning requires the use of an instrument such as a logic analyzer that can display the relationships of data, strobes, and clock as they arrive at the DIMM. Adjustments can then be made on a byte lane basis to the 18 ByteWrClkDelayCxBxx registers to align the data with the clock, and the strobe to the center of the data.
One difference from the CPC925 is support of x4 ("by four") DIMMs. With these DIMMs, there is one strobe for every four data bits, instead of one per byte as is used with x8 devices. The ByteWrClkDelayCxBxx registers each contain two fields for adjusting the strobes in a byte lane, WrClkOffsetDeltaL (lower) and WrClkOffsetDeltaU (upper). For x4 memory devices, the lower value controls the strobe for the lower nibble while the upper value controls the other strobe used for the upper nibble. In x8 mode, with only one strobe for the whole byte, the lower field is used to control the strobe, with the upper field unused.
Read data strobe delay tuning
As explained earlier, when the read data is sent from the DIMM, it is accompanied by a strobe. Although the strobe eventually needs to be centered in the data, the DIMM sends it aligned with the start of the data. It is up to the controller to delay the strobe by one half clock cycle so that it is in the center of the data. Due to board wiring layout distances, capacitive effects, and so on, the optimal amount of the delay may be more or less than exactly one half clock, and may vary across byte lanes. The 18 ReadStrobeDelayCxBxx registers are used to add or subtract delays from the strobes. As with the delays for writes, there are separate offsets for the lower and upper strobes used in x4 mode. In addition, there are separate delays for the rising and falling edges of the strobes, allowing them to be adjusted independently. In x8 mode, with only one strobe per byte lane, the configuration registers for the lower and upper strobes should be set to identical values.
Software is available to try every combination of rising- and falling-edge delays to determine the valid ranges. This relies on having data being reliably written to memory so that it can be read back and tested. Fortunately, the window is large enough that the default value of 0 in all of the ReadStrobeDelayCxBxx registers will allow at least small amounts of data to be read back to confirm that the write tuning is correct, even if the strobe is not precisely centered in the data. The tuning process is needed so that centering can be determined for reliable use across a range of environmental conditions, such as temperature and variations in voltage and chip speeds.
An implementation of this software can be found in the PowerPC 970MP Evaluation Kit, in the 405 (service processor) PIBS mem_tune command. The mem_tune command also provides implementations of both the coarse and fine read delay tuning methods described above.
The algorithm simply selects all values of the rising edge delay and combines each of them with all possible values of the falling edge delay. It then reads a known block of data. If the data is validated, then the particular combination of edge delays is considered a success. By noting where the edges of the successful values lie, the center of the range can be determined and programmed into the ReadStrobeDelayCxBxx registers.
When starting the test at the edges of the ranges, it's only necessary to test a small amount of data, such as a single word, to see whether it will fail. This is a fast test and will rapidly move through a series of failing values. When a single word can be read successfully, more data should be tested as a stronger test to ensure that a significant amount of data can be read successfully. To do this, a block of memory of around 1KB is read. To perform this in an efficient manner, it is useful to have a fast data path between memory and the service processor which is performing these tests. Accesses to the CPC945 bridge which have been performed in previous tests have been through the programmer's interface, which is accessed over a slow I2C link. If there is a higher speed link to memory available, it should be used for this test. The PowerPC 405 service processor has a PCI bus available, and this may be connected through a bridge to the hypertransport bridge on the CPC945, which would be one such way to implement a high speed data path.
After this tuning has been completed, the memory controller is able to successfully read and write data to memory. As always, it is wise to execute a memory test to all memory locations to confirm that all memory can be accessed reliably.
Summary
Memory controllers are sensitive to the physical layout of the designs in which they are used. The auto-calibration hardware included in the IBM CPC945 memory controller reduces the time needed to determine optimal tuning values on a new board design. The use of fine-tuning adjustments increases the accuracy of timing signals and ensures more reliable operation across changing environmental conditions.
Downloads | Name | Size | Download method |
|---|
| pa-cpc1code-coarse.txt | 3KB | HTTP | | pa-cpc1code-fine.txt | 3KB | HTTP |
Resources
About the author  | |  | Neil Leeder works in the PowerPC enablement group within IBM Microelectronics. He has worked in PowerPC embedded software development since 1993, and was part of the team which created OS Open, the first real-time operating system for embedded PowerPC processors. He played a key role in the development of IBM's software evaluation kits for the 405GP and 440GP processors. Neil has a degree in Computing and Information Systems from the University of Manchester in England. You can contact Neil at neileedr@us.ibm.com. |
Rate this page
|