SoC drawer: Detecting and correcting I/O and memory errors

Basic theory and practice

SoCs (systems-on-chips) are often deployed in communications, storage, network processing, and mission-critical embedded data processing systems. A reliable SoC-based system must mitigate and control environmentally induced errors in stored or transported data. It is impossible to fully prevent data loss, but engineering due diligence is required to ensure that systems are as safe as practically possible given current data coding methods for error detection and correction. This article examines methods to minimize potential data corruption and to maximize system safety when uncorrectable errors do occur.

Sam Siewert (Sam.Siewert@Colorado.edu), Adjunct Professor, University of Colorado

Sam SiewertDr. Sam Siewert is an embedded system design and firmware engineer who has worked in the aerospace, telecommunications, and storage industries. He also teaches at the University of Colorado at Boulder part-time in the Embedded Systems Certification Program, which he co-founded. His research interests include autonomic computing, firmware/hardware co-design, microprocessor/SoC architecture, and embedded real-time systems.



21 March 2006

This article is the sixth in the SoC drawer series. The series aims to provide the system architect with a starting point and some tips to make system-on-a-chip (SoC) design easier. This article focuses on I/O and memory reliability and methods for detecting and correcting data errors. Data corruption can lead to customer loss, damages, litigation, and catastrophic consequences. To ensure SoC performance, the methods you use to detect and correct errors should have minimal resource requirements and impact on software.

To efficiently and reliably implement safe hardware and software systems, hardware and firmware engineers need to coordinate their design work on error detection and correction functionality. Most often, hardware provides error encoding checks and detection, with software coordination required to initialize ECC (error correction circuitry) features and to recover from uncorrectable errors or to provide system safing.

This article reviews fundamental theory on the upper bound for data transport in an environment with noise. Emergent methods such as turbo codes have allowed this theoretical limit, known as the Shannon limit, to be approached in real-world settings. Likewise, for data at rest, subject to corruption from EMI (electromagnetic interference) or particle radiation (solar wind and cosmic background), we will examine encoding schemes that provide single-bit error detection and correction with multi-bit error detection and safing. This article only offers an introduction to these schemes and technologies; for more in-depth information, see the links in Resources below.

A quick review of Shannon's channel coding theorem

The carrying capacity of channel data increases with bandwidth. This fact has pushed network link interconnection networks to today's high frequency rates, which run to many gigahertz. However, capacity is limited by the signal-to-noise ratio on a given link as a function of the binary log of the sum of the signal power plus noise divided by the noise (Equation 1). A signal with very low power relative to noise approaches zero capacity due to high-bit error rate. A signal equal in power to noise has capacity equal to bandwidth.

Emergence of turbo codes

Turbo codes come close to performance at the Shannon limit, the theoretical limit of maximum information transfer rate over a noisy channel. First introduced in a paper by Berrou, Glavieux, and Thitimajshima (see Resources), turbo codes have continued to evolve. Conceptually, turbo codes use maximum likelihood and two decoders that drive a hypothesis for the transmitted data and parity bits, comparing possible decodes, and driving to greatest confidence in a common decoded bitstream.

Equation 1. Shannon channel capacity limit
Shannon channel capacity limit

In Equation 1:

  • C is the channel capacity in bits/second after error correction is applied.
  • BW is the channel bandwidth in cycles/second (hertz).
  • S is the total signal power in watts.
  • S/N is the signal to noise ratio comparing the communication data signal to white noise (in decibels).

The noisy channel coding theorem does not provide techniques to realize capacity at or near the Shannon limit. Until recently, FEC (forward error correction) methods for transmission channels fell short of the theoretical upper bound defined by the Shannon limit. Turbo codes (see the sidebar for more information) have been shown to provide a channel capacity very close to the limit.

FEC methods provide redundancy of data in a transmitted message, allowing signal data to be recovered in spite of noise. Likewise, many FEC methods, such as Reed-Solomon encoding, use the mathematical properties of an encoding to recover lost data (erasures) and to correct corrupted symbols. Human message interpreters provide the same type of intelligence by filling in message gaps and errors estimating the likelihood of missing symbols and symbols in error, aided by some symbol redundancy. For example, with a little concentration most people can read the following sentence:

TXehh  qqicukk  XborwXn  ffoXx  jjpmuedd  oXevr  XtteXh  llzayX  ddoXg.

This sentence contains numerous erasures and symbol corruptions. Each first and last letter is repeated twice, with two spaces between words, and the intermediate symbols have been reordered; in all, the sentence includes nine erasures (indicated by the special symbol "X" in the example) and fifteen symbol corruptions. Still, most people would not find the sentence that difficult to read. In principle, FEC uses methods similar to the ones you use to interpret that sentence to build redundancy into a transmitted message. FEC employs the expected properties of the encoded message to recover its meaning in spite of erasures and symbol corruptions. The major methods of error correction include:

ECC requires software cooperation

Block ECC methods for memory protection require special handling. First, software should always fully initialize ECC memory before use to ensure that the check bits are properly encoded prior to memory reads. Failing to do so would most likely lead to a single-bit error (SBE) or multi-bit error (MBE) exception on access. Second, the corrected data for SBEs provided by the ECC when data is read from memory into a working register must be written back to minimize the possibility of SBEs evolving into MBEs over time. Software can do this in response to an SBE interrupt where the software handler must write the corrected data back; or, hardware read-modify-write logic can handle it so software does not need to be involved.

Finally, software background tasks that traverse memory, reading words in the process, can be used to scrub memory to correct SBEs before they become MBEs. When an MBE occurs, systems should not continue code execution, because data or code executed could be faulty. If the MBE is in a benign location, the MBE handler might decide to execute recovery code from an uncorrupted memory location; but, in some extreme cases where an MBE has affected operating code, the system must simply halt and reboot to recover.

  • Block coding methods evolved from simple methods of duplicating transmission data and evolved into Hamming codes, which use parity bits over blocks of data to encode single-bit errors and their locations along with multi-bit errors. The Reed-Solomon block codes extend the performance of Hamming codes by constructing a polynomial from transmit data symbols to send an oversampled plot of the polynomial instead of the original symbols. A receiver can then reconstruct the original polynomial and corresponding data symbols despite transmission errors within a given block of data below a threshold number of errors. Block encoding methods also include BCH (Bose, Ray-Chaudhuri, Hocquenghem) and Golay methods.

  • Convolutional codes, such as Viterbi codes, outperform block codes (including Reed-Solomon); as a result, most transmission error detection and correction coding has become convolutional. Viterbi encoding, introduced in 1967, has allowed modems to optimize the use of analog phone lines for transmission of digital information. The Viterbi algorithm constructs the most likely sequence of hidden states for a given state machine used to encode data. This basic concept has been extended and applied to CDMA (Code Division Multiple Access) and has formed the basis for more advanced hybrid methods like turbo codes. Likewise, block codes like Reed-Solomon have been extended to use two or more simple block code methods with data layouts in space and time that are interleaved to increase detection and correction performance.

  • High speed networks, such as gigabit Ethernet, fibre channel, and serial-attached SCSI, benefit from DC balancing codes, which control the running bit disparity on differential transmit/receive pairs. By using a symbol encoding look-up table to encode 8-bit symbols in 10 bits, the number of 1s and 0s on a transmission line can be balanced so that, at any given time, the number of 1s out-numbering 0s (or vice versa) is no more than two. Not only does this electrically improve DC balance and signal-to-noise ratio, it allows for receivers to detect errors when the 10-bit symbols do not decode correctly. DC balancing codes include 8B/10B (used for gigabit Ethernet) and 64B/66B (used for 10G Ethernet).

  • Other methods use combinations of convolutional codes interleaved (for example, turbo codes) or interleaved block codes (for example, cross interleaved Reed-Solomon coding, or CIRC).

Choice of encodings or combinations of encodings should take into consideration the following:

  • Bit error characteristics. These can be measured with a BERT (bit error rate tester). A BERT is a simple device that injects a known bit stream at a high rate into a network and verifies the bit stream received in return to characterize bit errors. Testing is one of the best ways to characterize a system in terms of rates, and to determine if bit errors are bursty or tend to be random, single errors. The physical transport and write/read methods will also suggest whether bursty or single-bit errors are more likely. For example, for radio transmission, due to EMI, one would expect bursty errors; in SRAM (static random access memory), bit errors would be more likely due to single magnetic upset events to a memory cell. However, measurement is always one of the best ways to characterize bit error rates.

  • Overhead. For error detection and correction, overhead is simply the portion of the bits in an encoded data block that are part of the error detection and correction scheme, rather than raw data. The Hamming encoding examples provided in the next section of this article require 13 bits for an 8-bit symbol: that makes for a 38.5% overhead. An 8B/10B encoder has 20% overhead. The performance, or the ability to detect and correct bit errors, relative to the overhead for the encoding is the basic trade-off that must be made for a given SoC error-correcting device.

  • Cost and complexity: Most often encoding and decoding are supported by hardware ECC (error correction circuitry), but hardware and software must be developed to support the scheme chosen. Cost and complexity trade-offs might depend heavily on time-to-market, availability of ECC or FEC hardware cells, data rates that must be supported, and the division between hardware and software handling for errors detected. Nearly all high data rate SoC I/O or storage devices will require hardware-based ECC or FEC; software is involved only for exceptional error handling scenarios, not for in-band error detection and correction.


Error detection and correction: An example

Stored data can become corrupted by physical phenomena that cause SEUs (single event upsets), which flip bits and can cause SBEs (single-bit errors) or MBEs (multi-bit errors). One of the oldest and simplest methods used to detect random single bit errors in data streams or data at rest is the addition of a check bit, or parity bit.

The parity bit follows a convention of adding one bit to each data word to encode the overall word with an even or odd number of bits. This allows the decoder to detect a single-bit flip in parity or the data word; however, multiple-bit flips can go undetected. Furthermore, the use of a parity bit provides no information to allow the decoder to correct a single-bit flip. An incorrect parity indicates that a bit has flipped, but does not indicate the position of the flipped bit. Richard Hamming, working at Bell Labs, devised a block encoding scheme to detect and correct single-bit errors as well as detect multi-bit errors by encoding the position of an SBE in multiple-check bits.

Hamming encoding with parity

The Hamming encoding is best described with an example. The Hamming code with one parity bit added provides for correction of SBEs that occur anywhere in the data bits or check bits. The check bits indicate the existence of SBEs and MBEs as well as the location of the bit in error for SBEs.

Figure 1 shows pW, the overall encoded word parity bit, which encodes parity so that it is even. Four more parity bits are interleaved with data bits to provide indication of bits in error and, for a single-bit error, the position of the errant bit in the overall word. Figure 1 shows the original data word in orange cells, and the bit positions for p01 used to encode it for even parity are highlighted as grey cells in the p01 row. Likewise for the p02, p03, and p04 rows (all of which encode even), parity for the data bits are indicated with the grey highlighted cells.

The parity bits and data bits are interleaved to form ED, the encoded data. The SYN (syndrome) is a word computed from check bits, which are 0 for expected even parity and 1 for odd parity, using the same bit positions shown with the same grey highlighted cells. If all of the syndrome check bits and pW are 0, then the originally encoded data has no errors and can simply be pulled from columns d01 through d08.

Figure 1. Hamming code with no errors
Hamming code with no errors

A scenario with data errors makes the Hamming code easier to understand. Figure 2 shows such a scenario. We encode the data byte d01-d08 as before in Figure 1. However, in this case, d02 has flipped while being transmitted or read, as indicated by the red highlighted cell. Note now that when you compute the syndrome check bits and pW as before, you have odd parity in check bits c01 and c03 as well as pW. The original parity bits have been interleaved with the data so that the check bits locate the SBE. The value of 5 for c01:c04 indicates that bit 5 of the original encoded word is in error. Furthermore, if only one bit flipped in bit positions 1 through 12, then pW would be expected to indicate overall odd parity. Since it does in Figure 2, you can simply flip bit 5 back to correct this SBE.

Figure 2. Hamming code with a single-bit error
Hamming code with a single-bit error

An MBE scenario will indicate a bit in error with non-zero check bits; however, because more than one bit has flipped in Figure 3, the pW is still even. An even parity with non-zero check bits indicates an uncorrectable multi-bit error. Furthermore, the check bits c01:c04 indicate that bit 14, which is out of range, is in error. The combination of pW even along with non-zero check bits or pW odd with check bits out of range indicates an MBE that can't be corrected, but which has at least been detected.

Figure 3. Hamming code with a multi-bit error
Hamming code with a multi-bit error

Finally, Figure 4 illustrates a scenario where one of the check bits is flipped for a check bit SBE. pW will detect this: when pW is one and the check bits are in range, then this indicates that a parity bit has flipped and the check bits c01:c04 encode the bit position 8, which had an SBE.

Figure 4. Hamming code with check bit single-bit error
Hamming code with checkbit single-bit error

Conclusion

Error detection and correction are necessary components of reliable and safe SoCs. For consumer electronics, ECC has greatly improved the quality and reliability of media such as tape, compact disks, and DVDs. Likewise, the bit error rates for protocol link layers have been greatly reduced using encodings such as 8B/10B and 64B/66B. The importance of ECC for financial data systems or critical real-time systems is obvious, but ECC employed in consumer products has also vastly increased the quality and appeal of these items as well.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=106244
ArticleTitle=SoC drawer: Detecting and correcting I/O and memory errors
publish-date=03212006