This article is the sixth in the SoC drawer series. The series aims to provide the system architect with a starting point and some tips to make system-on-a-chip (SoC) design easier. This article focuses on I/O and memory reliability and methods for detecting and correcting data errors. Data corruption can lead to customer loss, damages, litigation, and catastrophic consequences. To ensure SoC performance, the methods you use to detect and correct errors should have minimal resource requirements and impact on software.
To efficiently and reliably implement safe hardware and software systems, hardware and firmware engineers need to coordinate their design work on error detection and correction functionality. Most often, hardware provides error encoding checks and detection, with software coordination required to initialize ECC (error correction circuitry) features and to recover from uncorrectable errors or to provide system safing.
This article reviews fundamental theory on the upper bound for data transport in an environment with noise. Emergent methods such as turbo codes have allowed this theoretical limit, known as the Shannon limit, to be approached in real-world settings. Likewise, for data at rest, subject to corruption from EMI (electromagnetic interference) or particle radiation (solar wind and cosmic background), we will examine encoding schemes that provide single-bit error detection and correction with multi-bit error detection and safing. This article only offers an introduction to these schemes and technologies; for more in-depth information, see the links in Resources below.
A quick review of Shannon's channel coding theorem
The carrying capacity of channel data increases with bandwidth. This fact has pushed network link interconnection networks to today's high frequency rates, which run to many gigahertz. However, capacity is limited by the signal-to-noise ratio on a given link as a function of the binary log of the sum of the signal power plus noise divided by the noise (Equation 1). A signal with very low power relative to noise approaches zero capacity due to high-bit error rate. A signal equal in power to noise has capacity equal to bandwidth.
Equation 1. Shannon channel capacity limit
In Equation 1:
- C is the channel capacity in bits/second after error correction is applied.
- BW is the channel bandwidth in cycles/second (hertz).
- S is the total signal power in watts.
- S/N is the signal to noise ratio comparing the communication data signal to white noise (in decibels).
The noisy channel coding theorem does not provide techniques to realize capacity at or near the Shannon limit. Until recently, FEC (forward error correction) methods for transmission channels fell short of the theoretical upper bound defined by the Shannon limit. Turbo codes (see the sidebar for more information) have been shown to provide a channel capacity very close to the limit.
FEC methods provide redundancy of data in a transmitted message, allowing signal data to be recovered in spite of noise. Likewise, many FEC methods, such as Reed-Solomon encoding, use the mathematical properties of an encoding to recover lost data (erasures) and to correct corrupted symbols. Human message interpreters provide the same type of intelligence by filling in message gaps and errors estimating the likelihood of missing symbols and symbols in error, aided by some symbol redundancy. For example, with a little concentration most people can read the following sentence:
TXehh qqicukk XborwXn ffoXx jjpmuedd oXevr XtteXh llzayX ddoXg.
This sentence contains numerous erasures and symbol corruptions. Each first and last letter is repeated twice, with two spaces between words, and the intermediate symbols have been reordered; in all, the sentence includes nine erasures (indicated by the special symbol "X" in the example) and fifteen symbol corruptions. Still, most people would not find the sentence that difficult to read. In principle, FEC uses methods similar to the ones you use to interpret that sentence to build redundancy into a transmitted message. FEC employs the expected properties of the encoded message to recover its meaning in spite of erasures and symbol corruptions. The major methods of error correction include:
Block coding methods evolved from simple methods of duplicating transmission data and evolved into Hamming codes, which use parity bits over blocks of data to encode single-bit errors and their locations along with multi-bit errors. The Reed-Solomon block codes extend the performance of Hamming codes by constructing a polynomial from transmit data symbols to send an oversampled plot of the polynomial instead of the original symbols. A receiver can then reconstruct the original polynomial and corresponding data symbols despite transmission errors within a given block of data below a threshold number of errors. Block encoding methods also include BCH (Bose, Ray-Chaudhuri, Hocquenghem) and Golay methods.
Convolutional codes, such as Viterbi codes, outperform block codes (including Reed-Solomon); as a result, most transmission error detection and correction coding has become convolutional. Viterbi encoding, introduced in 1967, has allowed modems to optimize the use of analog phone lines for transmission of digital information. The Viterbi algorithm constructs the most likely sequence of hidden states for a given state machine used to encode data. This basic concept has been extended and applied to CDMA (Code Division Multiple Access) and has formed the basis for more advanced hybrid methods like turbo codes. Likewise, block codes like Reed-Solomon have been extended to use two or more simple block code methods with data layouts in space and time that are interleaved to increase detection and correction performance.
High speed networks, such as gigabit Ethernet, fibre channel, and serial-attached SCSI, benefit from DC balancing codes, which control the running bit disparity on differential transmit/receive pairs. By using a symbol encoding look-up table to encode 8-bit symbols in 10 bits, the number of 1s and 0s on a transmission line can be balanced so that, at any given time, the number of 1s out-numbering 0s (or vice versa) is no more than two. Not only does this electrically improve DC balance and signal-to-noise ratio, it allows for receivers to detect errors when the 10-bit symbols do not decode correctly. DC balancing codes include 8B/10B (used for gigabit Ethernet) and 64B/66B (used for 10G Ethernet).
Other methods use combinations of convolutional codes interleaved (for example, turbo codes) or interleaved block codes (for example, cross interleaved Reed-Solomon coding, or CIRC).
Choice of encodings or combinations of encodings should take into consideration the following:
Bit error characteristics. These can be measured with a BERT (bit error rate tester). A BERT is a simple device that injects a known bit stream at a high rate into a network and verifies the bit stream received in return to characterize bit errors. Testing is one of the best ways to characterize a system in terms of rates, and to determine if bit errors are bursty or tend to be random, single errors. The physical transport and write/read methods will also suggest whether bursty or single-bit errors are more likely. For example, for radio transmission, due to EMI, one would expect bursty errors; in SRAM (static random access memory), bit errors would be more likely due to single magnetic upset events to a memory cell. However, measurement is always one of the best ways to characterize bit error rates.
Overhead. For error detection and correction, overhead is simply the portion of the bits in an encoded data block that are part of the error detection and correction scheme, rather than raw data. The Hamming encoding examples provided in the next section of this article require 13 bits for an 8-bit symbol: that makes for a 38.5% overhead. An 8B/10B encoder has 20% overhead. The performance, or the ability to detect and correct bit errors, relative to the overhead for the encoding is the basic trade-off that must be made for a given SoC error-correcting device.
Cost and complexity: Most often encoding and decoding are supported by hardware ECC (error correction circuitry), but hardware and software must be developed to support the scheme chosen. Cost and complexity trade-offs might depend heavily on time-to-market, availability of ECC or FEC hardware cells, data rates that must be supported, and the division between hardware and software handling for errors detected. Nearly all high data rate SoC I/O or storage devices will require hardware-based ECC or FEC; software is involved only for exceptional error handling scenarios, not for in-band error detection and correction.
Error detection and correction: An example
Stored data can become corrupted by physical phenomena that cause SEUs (single event upsets), which flip bits and can cause SBEs (single-bit errors) or MBEs (multi-bit errors). One of the oldest and simplest methods used to detect random single bit errors in data streams or data at rest is the addition of a check bit, or parity bit.
The parity bit follows a convention of adding one bit to each data word to encode the overall word with an even or odd number of bits. This allows the decoder to detect a single-bit flip in parity or the data word; however, multiple-bit flips can go undetected. Furthermore, the use of a parity bit provides no information to allow the decoder to correct a single-bit flip. An incorrect parity indicates that a bit has flipped, but does not indicate the position of the flipped bit. Richard Hamming, working at Bell Labs, devised a block encoding scheme to detect and correct single-bit errors as well as detect multi-bit errors by encoding the position of an SBE in multiple-check bits.
Hamming encoding with parity
The Hamming encoding is best described with an example. The Hamming code with one parity bit added provides for correction of SBEs that occur anywhere in the data bits or check bits. The check bits indicate the existence of SBEs and MBEs as well as the location of the bit in error for SBEs.
Figure 1 shows pW, the overall encoded word parity bit, which encodes parity so that it is even. Four more parity bits are interleaved with data bits to provide indication of bits in error and, for a single-bit error, the position of the errant bit in the overall word. Figure 1 shows the original data word in orange cells, and the bit positions for p01 used to encode it for even parity are highlighted as grey cells in the p01 row. Likewise for the p02, p03, and p04 rows (all of which encode even), parity for the data bits are indicated with the grey highlighted cells.
The parity bits and data bits are interleaved to form ED, the encoded data. The SYN (syndrome) is a word computed from check bits, which are 0 for expected even parity and 1 for odd parity, using the same bit positions shown with the same grey highlighted cells. If all of the syndrome check bits and pW are 0, then the originally encoded data has no errors and can simply be pulled from columns d01 through d08.
Figure 1. Hamming code with no errors
A scenario with data errors makes the Hamming code easier to understand. Figure 2 shows such a scenario. We encode the data byte d01-d08 as before in Figure 1. However, in this case, d02 has flipped while being transmitted or read, as indicated by the red highlighted cell. Note now that when you compute the syndrome check bits and pW as before, you have odd parity in check bits c01 and c03 as well as pW. The original parity bits have been interleaved with the data so that the check bits locate the SBE. The value of 5 for c01:c04 indicates that bit 5 of the original encoded word is in error. Furthermore, if only one bit flipped in bit positions 1 through 12, then pW would be expected to indicate overall odd parity. Since it does in Figure 2, you can simply flip bit 5 back to correct this SBE.
Figure 2. Hamming code with a single-bit error
An MBE scenario will indicate a bit in error with non-zero check bits; however, because more than one bit has flipped in Figure 3, the pW is still even. An even parity with non-zero check bits indicates an uncorrectable multi-bit error. Furthermore, the check bits c01:c04 indicate that bit 14, which is out of range, is in error. The combination of pW even along with non-zero check bits or pW odd with check bits out of range indicates an MBE that can't be corrected, but which has at least been detected.
Figure 3. Hamming code with a multi-bit error
Finally, Figure 4 illustrates a scenario where one of the check bits is flipped for a check bit SBE. pW will detect this: when pW is one and the check bits are in range, then this indicates that a parity bit has flipped and the check bits c01:c04 encode the bit position 8, which had an SBE.
Figure 4. Hamming code with check bit single-bit error
Error detection and correction are necessary components of reliable and safe SoCs. For consumer electronics, ECC has greatly improved the quality and reliability of media such as tape, compact disks, and DVDs. Likewise, the bit error rates for protocol link layers have been greatly reduced using encodings such as 8B/10B and 64B/66B. The importance of ECC for financial data systems or critical real-time systems is obvious, but ECC employed in consumer products has also vastly increased the quality and appeal of these items as well.
- Although Wikipedia is often criticized for containing much that is apocryphal and/or wildly inaccurate, it is even more often "a good starting place" for learning more on a given subject, particularly for technical material. You can find overviews of most of the ECC and FEC codes discussed in this article on Wikipedia, including Hamming code, 8B/10B encoding, Reed-Solomon error correction, FEC, ECC, the Shannon-Hartley theorem, turbo codes, Cross Interleaved Reed-Solomon coding, and Galois theory, the mathematical basis for Reed-Solomon code.
- The book Real-Time Embedded Components and Systems, Sam Siewert (Thomson-Delmar Learning, June 2006), discusses basic concepts of ECC, error detection, handling, and recovery, a key feature of safety critical real-time embedded systems and SoC design.
- The University of Notre Dame Coding Research Group includes numerous links, papers, and resources for turbo codes.
- Historically, Reed-Solomon and FEC codes based upon it have been used for NASA deep space projects, but turbo codes are catching on for deep space communications, and you can find some great resources at the JPL turbo code Web page.
- "A DC-balanced, partitioned-block, 8B/10B transmission code," A. X. Widmer and P. A. Franaszek (IBM Research Journal, 1983) describes the hardware design and properties of this encoding scheme which controls the running bit disparity for many gigabit transports, including gigabit Ethernet, Serial Attached SCSI and ATA, Fibre Channel, Infiniband, Firewire, and PCI-Express.
- "Near Shannon limit error-correcting coding and decoding: Turbo-codes," (Proceedings of the IEEE International Conference on Communications, 1993) introduced the concept and detailed design for turbo codes. (Link paper is a PDF.)
Get products and technologies
- Explore Galois theory with a C++ library that includes a Reed-Solomon encoder/decoder example at Galois Field Arithmetic Library.
- Companies like Digital Fountain are applying turbo code FEC for streaming and data distribution.
- Download open source Reed-Solomon and Viterbi FEC codes from Phil Karn's Web page; and find an even wider selection from the Error Correcting Codes (ECC) page.
Dig deeper into developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.