__ECC RAM__

** **

**What is error correction?**

In digital electronic systems, information is represented in binary format (1's
and 0's). When binary information is passed from one point to another, there is
always some chance that a mistake can be made; a 1 interpreted as a 0 or a 0
taken to be a 1. This can be caused by media defects, electronic noise,
component failures, poor connections, deterioration due to age, and other
factors. When a bit is mistakenly interpreted, a bit error has occurred. Error
correction is the process of detecting bit errors and correcting them and can
be done in software or hardware. For high data rates, error correction must be
done in special-purpose hardware because software is too slow.

** **

**Explain the basic concept
of error correcting systems.** A group of bits in a computer has
conventionally been referred to as a "word". Each bit can be thought
of as one of two "letters", 0 or 1. Error correcting systems add
extra or "redundant" letters to computer words. The extra letters
(bits) add a certain structure to each word. If that structure is altered by
errors, the changes can be detected and corrected.

English words provide an analogy. Not all possible combinations of the English
alphabet are legitimate words. The only legitimate words are those listed in a
dictionary of the English language. Errors that occur when transmitting or
storing English words can be detected by determining if the received word is in
the dictionary. If it is not, errors can be corrected by determining which
legitimate English word is closest to the received word. Error correcting
systems work in a similar fashion.

__Brief
history about error correction.__

Around 1947-1948, the subject of
information theory was created by Claude Shannon. The main result of Shannon's
"Mathematical Theory of Communication" is that the only way to get
the most storage capacity in a storage device or the fastest transmission
through a communications channel is through the use of very powerful error
correcting systems. During the same time period, Richard Hamming discovered and
implemented a single-bit error correcting code.

In 1960, researchers, including Irving Reed and Gustave Solomon, discovered how
to construct error correcting codes that could correct for an arbitrary number
of bits or an arbitrary number of "bytes" where "byte"
means a group of "w" bits. Even though the codes were discovered at
this time, there still was no way known to decode the codes.

The first textbook on error correcting codes was written in 1961 by W. Wesley
Peterson.

In 1968, Elwyn Berlekamp and James Massey discovered algorithms needed to build
decoders for multiple error correcting codes. They came to be known as the
Berlekamp-Massey algorithm for solving the key decoding equation.

In the last 30 years, researchers have discovered that the Berlekamp-Massey
algorithm is a variation of an ancient algorithm discovered in Egypt around 300
BC by Euclid and known as Euclid's extended algorithm for finding the greatest
common divisor of two polynomials. Today, numerous variations of the
Berlekamp-Massey and Euclid algorithms exist to solve the key decoding
equation.

**What mathematics is
involved in the theory of error correction?** Error
correcting encoders and decoders operate on bits or bytes. From a mathematical
point of view, each bit or byte is an element in a finite field, which is an
abstract algebraic system with a finite number of elements and two operations
and their inverses defined. The operations are usually called multiplication
and addition and their inverses called division and subtraction.

Finite fields are also called Galois fields (pronounced Gal Wah) in honor of
the French mathematician Evariste Galois (1811-1832) who first discovered them.
Finite field theory is a fairly large branch of mathematics.

It is relatively straightforward to implement finite field operations in
digital logic circuitry or in computer programs.

**Why is error correction
needed?** Error correction is needed to ensure the accuracy and
integrity of data and, in some cases, to create fault-tolerance (where
components can fail with no loss of performance or data).

No electronic data transmission or storage system is perfect. Each system makes
errors at a certain rate. As data transfer rates and storage densities
increase, the raw error rate also increases. See also How frequently do bit
errors occur?

**How frequently do bit
errors occur?** If you look at bit error rate in terms of media,
some electronic systems experience more errors than others. For instance,
optical disk has a higher error rate than magnetic disk. Magnetic tape has a
higher error rate than magnetic disk. Fiber optic communications cable and
semiconductor memory have a low error rate.

One way to measure bit error rate is in terms of the number of bit errors
divided by the total bits transferred. Using magnetic disk as an example, if
you were to look at bits coming right off a magnetic disk you would see, on the
average, about 1 bit error for every billion bits transferred. If you were to
look at bits coming right off an optical disk, you would see a much higher rate
of errors, about 1 bit error in every 100,000 bits transferred. Without some
form of error detection and correction, most storage devices would be too
unreliable to be useful.

Another way to measure bit error rate is in terms of how many bit errors occur
in a unit of time. Again, using magnetic disk as an example, if you transfer a
million bits per second, on the average, you'd have a bit error every thousand
seconds, or every 16.6 minutes. If you transfer a billion bits per second, on
the average, you'd have a bit error every second. Currently, some disk drives
transfer at the rate of 40 million bits per second, so that a bit error occurs
every 25 seconds.

As a general rule, in all electronic devices, transfer rates are increasing as
time goes on. Also, errors will occur more frequently in any kind of storage
device because manufacturers are squeezing more bits into smaller and smaller
spaces. As speed and density increase, error correction becomes a necessity.

**Is error correction done in
hardware or software?** Error correction can either be done in hardware or
software depending upon how fast it has to be done. In most cases, such as
magnetic disk drives or semiconductor memory, errors must be detected and
corrected "on-the-fly", or, at the same rate as data is being read
from the disk. The "on-the-fly" performance requirement usually means
the error correction implementation must be done in digital logic (hardware).

In some cases, such as some telecommunications systems where the data rates are
relatively low, software implementations of error correction may be fast
enough.

** **

**How can error correction
provide fault-tolerance?** If encoded data is written to a
storage device in such a way that each byte of the encoded data is recorded on
a separate component, then fault-tolerance has been created. If one or more
components fail, they can only corrupt the data bytes that are written on them.

As long as the number of component failures is less than or equal to the
correction capability of the error correcting code, the failures will not
result in any loss of data or performance.

RAID (Redundant Arrays of Inexpensive Devices) is an example of using an error
correcting system to create fault-tolerance, even though most RAID systems use
a simple single bit erasure correcting system.

**In what ways can error
correction be used?** Error correction is currently being used to make
high capacity storage devices such as magnetic and optical disk and tape
reliable. See also How frequently do bit errors occur? Error correction enables
higher density recordings. In magnetic disk drives, for example, inexpensive
and reliable 2 to 20 GigaByte capacity drives would not be possible without
error correction.

Error correction can be used to decrease the price and increase the reliability
of DRAM memory modules (SIMMS) because lower-grade DRAM components can be used.

Error correction can be used to extend the battery life of portable computers
and other battery-powered electronic devices by allowing a significant decrease
in the DRAM refresh rate.

In communications networks, such as ATM, error correction can be used to
reconstruct lost or missing packets without the need for retransmission and
with no loss of performance. For long-haul lines, frequent retransmission
causes severe performance degradation.

In RAID (Redundant Arrays of Inexpensive Devices), error correction can be used
to create an arbitrary level of fault-tolerance. Multiple devices can fail with
no loss of data or performance.

Error correction can also be used to provide fault-tolerant electronic cabling
systems, for example, fly-by-wire aircraft.

**What are the different
types of error correcting codes?** The two main types of error
correcting codes are convolutional (tree) codes and block codes. In each case,
strings of bits are divided into blocks of data and each block of data is
encoded into a longer block of data.

Convolutional codes tend to operate on smaller blocks of data than block codes
and, unlike block codes, the encoding of one block of data depends on the state
of the encoder as well as on the data to be encoded. Decoding of convolutional
codes is usually done by executing some type of decoding algorithm in a
processor.

With block codes, each block of data can be long and is encoded and decoded
independently from every other block of data. Block codes can be classified in
many different ways. For example, you could put all codes that operate on bits into
one category and call them "binary" codes. All other codes would then
be "non-binary" codes.

There are other properties of codes that can be used for classification, such
as whether a code is cyclic, shortened, algebraic, systematic, perfect, etc.

**What's special about
Reed-Solomon codes?** Reed-Solomon codes are special and widely
implemented because they are "almost perfect" in the sense that the
extra (redundant) letters added on by the encoder is at a minimum for any level
of error correction, so that no bits are wasted.

Reed-Solomon codes are easier to decode than most other non-binary codes.

Reed-Solomon codes allow the correction of "erasures". An erasure is
like taking a pencil eraser and erasing a letter in a word. The letter that
should be in that position is unknown, but the position of the erasure is
known. Reed-Solomon codes can correct twice as many erasures as errors (where
the decoder has no information about the error). With Reed-Solomon error
correction, you get more correction power per dollar by being able to correct
multiple randomly positioned bytes in error.