ECC RAM

 

What is error correction?
In digital electronic systems, information is represented in binary format (1's and 0's). When binary information is passed from one point to another, there is always some chance that a mistake can be made; a 1 interpreted as a 0 or a 0 taken to be a 1. This can be caused by media defects, electronic noise, component failures, poor connections, deterioration due to age, and other factors. When a bit is mistakenly interpreted, a bit error has occurred. Error correction is the process of detecting bit errors and correcting them and can be done in software or hardware. For high data rates, error correction must be done in special-purpose hardware because software is too slow.

 

Explain the basic concept of error correcting systems. A group of bits in a computer has conventionally been referred to as a "word". Each bit can be thought of as one of two "letters", 0 or 1. Error correcting systems add extra or "redundant" letters to computer words. The extra letters (bits) add a certain structure to each word. If that structure is altered by errors, the changes can be detected and corrected.

English words provide an analogy. Not all possible combinations of the English alphabet are legitimate words. The only legitimate words are those listed in a dictionary of the English language. Errors that occur when transmitting or storing English words can be detected by determining if the received word is in the dictionary. If it is not, errors can be corrected by determining which legitimate English word is closest to the received word. Error correcting systems work in a similar fashion.

 

Brief history about error correction.

Around 1947-1948, the subject of information theory was created by Claude Shannon. The main result of Shannon's "Mathematical Theory of Communication" is that the only way to get the most storage capacity in a storage device or the fastest transmission through a communications channel is through the use of very powerful error correcting systems. During the same time period, Richard Hamming discovered and implemented a single-bit error correcting code.

In 1960, researchers, including Irving Reed and Gustave Solomon, discovered how to construct error correcting codes that could correct for an arbitrary number of bits or an arbitrary number of "bytes" where "byte" means a group of "w" bits. Even though the codes were discovered at this time, there still was no way known to decode the codes.

The first textbook on error correcting codes was written in 1961 by W. Wesley Peterson.

In 1968, Elwyn Berlekamp and James Massey discovered algorithms needed to build decoders for multiple error correcting codes. They came to be known as the Berlekamp-Massey algorithm for solving the key decoding equation.

In the last 30 years, researchers have discovered that the Berlekamp-Massey algorithm is a variation of an ancient algorithm discovered in Egypt around 300 BC by Euclid and known as Euclid's extended algorithm for finding the greatest common divisor of two polynomials. Today, numerous variations of the Berlekamp-Massey and Euclid algorithms exist to solve the key decoding equation.

 

What mathematics is involved in the theory of error correction? Error correcting encoders and decoders operate on bits or bytes. From a mathematical point of view, each bit or byte is an element in a finite field, which is an abstract algebraic system with a finite number of elements and two operations and their inverses defined. The operations are usually called multiplication and addition and their inverses called division and subtraction.

Finite fields are also called Galois fields (pronounced Gal Wah) in honor of the French mathematician Evariste Galois (1811-1832) who first discovered them. Finite field theory is a fairly large branch of mathematics.

It is relatively straightforward to implement finite field operations in digital logic circuitry or in computer programs.

 

Why is error correction needed? Error correction is needed to ensure the accuracy and integrity of data and, in some cases, to create fault-tolerance (where components can fail with no loss of performance or data).

No electronic data transmission or storage system is perfect. Each system makes errors at a certain rate. As data transfer rates and storage densities increase, the raw error rate also increases. See also How frequently do bit errors occur?

 

How frequently do bit errors occur? If you look at bit error rate in terms of media, some electronic systems experience more errors than others. For instance, optical disk has a higher error rate than magnetic disk. Magnetic tape has a higher error rate than magnetic disk. Fiber optic communications cable and semiconductor memory have a low error rate.

One way to measure bit error rate is in terms of the number of bit errors divided by the total bits transferred. Using magnetic disk as an example, if you were to look at bits coming right off a magnetic disk you would see, on the average, about 1 bit error for every billion bits transferred. If you were to look at bits coming right off an optical disk, you would see a much higher rate of errors, about 1 bit error in every 100,000 bits transferred. Without some form of error detection and correction, most storage devices would be too unreliable to be useful.

Another way to measure bit error rate is in terms of how many bit errors occur in a unit of time. Again, using magnetic disk as an example, if you transfer a million bits per second, on the average, you'd have a bit error every thousand seconds, or every 16.6 minutes. If you transfer a billion bits per second, on the average, you'd have a bit error every second. Currently, some disk drives transfer at the rate of 40 million bits per second, so that a bit error occurs every 25 seconds.

As a general rule, in all electronic devices, transfer rates are increasing as time goes on. Also, errors will occur more frequently in any kind of storage device because manufacturers are squeezing more bits into smaller and smaller spaces. As speed and density increase, error correction becomes a necessity.

 

Is error correction done in hardware or software? Error correction can either be done in hardware or software depending upon how fast it has to be done. In most cases, such as magnetic disk drives or semiconductor memory, errors must be detected and corrected "on-the-fly", or, at the same rate as data is being read from the disk. The "on-the-fly" performance requirement usually means the error correction implementation must be done in digital logic (hardware).

In some cases, such as some telecommunications systems where the data rates are relatively low, software implementations of error correction may be fast enough.

 

How can error correction provide fault-tolerance? If encoded data is written to a storage device in such a way that each byte of the encoded data is recorded on a separate component, then fault-tolerance has been created. If one or more components fail, they can only corrupt the data bytes that are written on them.

As long as the number of component failures is less than or equal to the correction capability of the error correcting code, the failures will not result in any loss of data or performance.

RAID (Redundant Arrays of Inexpensive Devices) is an example of using an error correcting system to create fault-tolerance, even though most RAID systems use a simple single bit erasure correcting system.

 

In what ways can error correction be used? Error correction is currently being used to make high capacity storage devices such as magnetic and optical disk and tape reliable. See also How frequently do bit errors occur? Error correction enables higher density recordings. In magnetic disk drives, for example, inexpensive and reliable 2 to 20 GigaByte capacity drives would not be possible without error correction.

Error correction can be used to decrease the price and increase the reliability of DRAM memory modules (SIMMS) because lower-grade DRAM components can be used.

Error correction can be used to extend the battery life of portable computers and other battery-powered electronic devices by allowing a significant decrease in the DRAM refresh rate.

In communications networks, such as ATM, error correction can be used to reconstruct lost or missing packets without the need for retransmission and with no loss of performance. For long-haul lines, frequent retransmission causes severe performance degradation.

In RAID (Redundant Arrays of Inexpensive Devices), error correction can be used to create an arbitrary level of fault-tolerance. Multiple devices can fail with no loss of data or performance.

Error correction can also be used to provide fault-tolerant electronic cabling systems, for example, fly-by-wire aircraft.

 

What are the different types of error correcting codes? The two main types of error correcting codes are convolutional (tree) codes and block codes. In each case, strings of bits are divided into blocks of data and each block of data is encoded into a longer block of data.

Convolutional codes tend to operate on smaller blocks of data than block codes and, unlike block codes, the encoding of one block of data depends on the state of the encoder as well as on the data to be encoded. Decoding of convolutional codes is usually done by executing some type of decoding algorithm in a processor.

With block codes, each block of data can be long and is encoded and decoded independently from every other block of data. Block codes can be classified in many different ways. For example, you could put all codes that operate on bits into one category and call them "binary" codes. All other codes would then be "non-binary" codes.

There are other properties of codes that can be used for classification, such as whether a code is cyclic, shortened, algebraic, systematic, perfect, etc.

 

What's special about Reed-Solomon codes? Reed-Solomon codes are special and widely implemented because they are "almost perfect" in the sense that the extra (redundant) letters added on by the encoder is at a minimum for any level of error correction, so that no bits are wasted.

Reed-Solomon codes are easier to decode than most other non-binary codes.

Reed-Solomon codes allow the correction of "erasures". An erasure is like taking a pencil eraser and erasing a letter in a word. The letter that should be in that position is unknown, but the position of the erasure is known. Reed-Solomon codes can correct twice as many erasures as errors (where the decoder has no information about the error). With Reed-Solomon error correction, you get more correction power per dollar by being able to correct multiple randomly positioned bytes in error.