CS 255 Notes 2
Introduction
This time we will start with representing symbols and binary arithmetic
and then go into how information is recorded on different kinds of storage.
Symbol Representation
Since the title of the class includes information representation, here
we discuss how to represent two simple kinds of data, characters and numbers.
Characters
In the early days, most computer manufacturers picked their own character
set. This was a mapping between the characters on the keyboard and the
binary numbers the machine knew how to store. For example, UNIVAC and Control
Data each chose a 6 bit number to represent each character. This allowed
only upper case characters. Later came a 7 bit standard known as ASCII
(American Standard Code for Information Interchange). IBM had a similar
system known as EBCDIC (Extended Binary Coded Decimal IC). Using
7 bits allowed the computer to represent 128 characters. This included
upper and lower case, numbers, punctuation and control characters. Control
characters got their names from that fact that on old Teletype systems,
these bit patterns would actually change the behavior of the terminal.
They still do to some extent. The number 7, seen as a character, causes
the terminal to make a sound. However, as time has gone on, we have found
that it would be better if we could represent more characters. So ASCII
was extended to 8 bits, giving us 255 characters. This is not nearly enough
to support some Asian writing systems, so a new standard is being developed
called
Unicode which uses 16 bits.There is also a rival standardization
effort that uses 32 bits.
Numbers
Simply storing numbers in binary notation works but doesn't allow negative
numbers. Storing the digits as characters uses up a lot of space. We would
use 2 bytes to store a 2 digit number. But 216 = 65K.
The largest 2 digit number is 99. Not very efficient. We and the book will
discuss two techniques, excess notation and twos complement.
Excess Notation
Now, I believe in moderation in all things, including notation. The term
here refers to a technique for mapping decimal numbers to binary bit patterns.
First, all integers that we are going to work with are represented as fixed
length bit patterns. For example, if we are using a 4 bit pattern, the
pattern 1000 is the middle of all possible 4 bit permutations. So it becomes
the code for zero. The patterns are used like this.
1000
|
=>
|
0
|
|
0000 |
=> |
-8 |
1001
|
=>
|
1
|
|
0001 |
=> |
-7 |
1010
|
=>
|
2
|
|
0111 |
=> |
-1 |
All negative numbers start with zero. See page 39 for the complete table.
This 4 bit example is called excess 8 because the difference between the
bit pattern used here and the binary number interpretation is 8.
To convert from decimal to excess 8, add 8 to the number, convert to binary
and pad with zero on the left to get 4 bits.
Twos Complement
This technique also uses a fixed width bit string to represent integers.
Positive values start at zero and count up in binary to 011111. Negative
numbers start at 111111 and count down until 100000. All negative numbers
start with a one. This is called the sign bit.
Coding/Decoding
Positive numbers are coded as a sign bit of zero and the rest of the bits
are used to store the number in binary. So +6 is stored as 0110 in a 4
bit system. A -6 is stored by first complementing the positive representation
and then adding 1. The complement of a number is the same number of bits
with each 1 changed to zero and each zero changed to 1. In this case, 0110
becomes 1001. Then we add 1 to it so -6 is stored as 1010. The book has
a different algorithm on page 41.
Addition
Now for some arithmetic with 2's complement numbers. This is very similar
to doing binary arithmetic except that the result has to be the same length
as the operands. So any carries to the left are deleted.
2+2=4
0010 + 0010 = 0100 = 4
2 + (-3) = 1
0010 + 1101 = 1111 = -1
4 + (-3) = 1
0100 + 1101 = 10001 = 0001 = 1
-2+(-2) = -4
1110 + 1110 = 11100 = 1100 = -4
One thing to notice is that we can do subtraction by a combination
of negation and addition. This is because a-b = a + (-b). We can do multiplication
and division this way as well since these are simply repeated additions
and subtractions.This is probably not the most efficient technique.
Overflow
Lets try another arithmetic problem.
5 + 4 = 9
0101 + 0100 = 1001 = -7
7 + 7 = 14
0111 + 0111 = 1110 = -2
This is a case of overflow. With only 4 bits, we can only represent
values from -8 to +7. But notice that if we interpret the results above
as binary numbers
0101 + 0100 = 1001 = 9
0111 + 0111 = 1110 = 14
The overflow problem is not really solvable using a fixed number of
bits. But modern computers use 32 bits for integers which ranges up to
+2,147,483,647, which is fine for most things. You can switch to double
precision which gets you 64 bits. Another technique is to change the scale.
That is, instead of measuring in grams, use kilograms.
Another approach is to use infinite precision arithmetic instead of
integer arithmetic. One example of this is binary coded decimal
(BCD).
This recorded each digit of a number as a separate bit pattern in memory.
So for example, instead of recording 12 as 1100, we would record the 1
as 0001 and the 2 as 0010 in two separate memory locations. Actually, we
could store 2 digits per byte. Then we do arithmetic digit by digit. Now
the only limits are the amount of memory.
Floating Point Numbers
Floating point numbers are real numbers. Not like those fake integers.
These are numbers like 3.14159265358. The technique for storing this in
memory is based on scientific notation. This is the method that
turns 314,000 into 3.14 x 105 . The 3.14 part is called the
mantissa
and the 5 part is called the exponent. Since we aren't using base
10 ( if someone invented a 10 state logic device, this would be easier),
we use powers of 2 and binary numbers. Also, we don't have a decimal point,
we have the more general radix point. For these examples,
we are working with 8 bit floating point numbers. The layout of the 8 bits
is a 1 bit sign, a 3 bit exponent, stored in 3 bit excess notation, and
a 4 bit mantissa, stored in binary.
So, to convert a ordinary number, like 3 1/4 (3.25) to floating point,
we first convert it to binary. This works out as 11.0100.
Then we pretend that the radix point is at the left of the number.
This looks like .1101 x 22 .
Now we copy the bits to the mantissa. Making it look like
Sign |
Exponent |
Mantissa |
|
|
|
|
1 |
1 |
0 |
1 |
Now, we convert the exponent to excess notation. Since we moved the
radix point 2 digits to the left, the exponent is 2. In the 3 bit excess
notation, 2 is 110. So we put this in the exponent part.
Sign |
Exponent |
Mantissa |
|
1 |
1 |
0 |
1 |
1 |
0 |
1 |
Since we started with a positive number, the sign bit is 0. Finally,
Sign |
Exponent |
Mantissa |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
Looked at another way, .1101 x 22 is 1/16 + 0/8 + 1/4
+ 1/2 = 1/16 + 0/16 + 4/16 + 8/16 = 13/16 * 4 = 52/16 = 3.25
Lets go the other direction. Given a bit pattern 01101011, what number
is it?
sign bit is 0 so it is positive. The exponent is 110, which is 2. The
mantissa is .1011. So the whole number is .1011 x 22 or 10.11.
This is 2 + 0 + 1/2 + 1/4 = 2 3/4 = 2.75.
Why is the exponent stored in excess notation? The claim is that we
can determine the relative size of two numbers faster. Take two floating
point numbers, A=00101010 and B=00011001. Which is bigger? The process
is to scan the digits from the left, the first number to have a 1 digit
is larger. So in this case, it would say that A is larger. Both of these
are positive numbers. The exponent of A is 010 or -2 and the exponent of
B is 001 or -3. That alone makes B smaller. This works because the bit
patterns in excess notation have more 1s in bigger numbers. See page 39.
If the exponents were in a kind of binary notation, A would be 110
and B would be 111, so this trick wouldn't work.
Round off and other errors
Code 3 5/8 in 8 bit floating point. First in binary, 11.101. Copy this
to the 4 bit mantissa we get
Sign |
Exponent |
Mantissa |
|
|
|
|
1 |
1 |
1 |
0 |
We have lost the last 1. This means we are off by 1/8. Continuing we
get
Sign |
Exponent |
Mantissa |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
If we convert this back, we get 3.5. This is a kind overflow
problem since we copied a 5 bit string into a 4 bit field. The obvious
fix is to increase the size of the floating point variable. Most current
machines use 32 bits. You can get 64 using doubles.
Another kind of arithmetic error is imprecision. 1/3 in decimal is
a non termination number. 1/10 doesn't terminate in binary.
Something like the BCD in the integer world can be done, also it matters
what order operations are done in. See page 47 for an example. A lot of
work has been done in the study of numeric methods to figure out what can
be done to minimize these kinds of errors.
Communications Errors
It doesn't take much to flip a bit. Cosmic rays, radioactive materials
in the chip packaging or a thunderstorm near the phone lines. One simple
technique to catch errors is called parity. This involves adding
on extra bit to each cell. If we are using odd parity, then
this bit
is set to give the cell an odd number of ones. If a bit gets flipped,
then there will be a even number of ones and we will know about the error.
We won't be able to tell what the error was. Even parity is similar.
This method is used heavily in main memory. On the memory card I showed,
there are 9 chips to give us 256K 9 bit cells. 8 bit bytes and a parity
bit.
For longer bit strings, we might use a parity byte, Each bit is used
as a parity bit for a part of the string. Or we could use a checksum.
This does some arithmetic on the bit string. If there is a change in the
bits. the sum we sent won't match the sum calculated at the other end.
If we add even more bits, we can not only tell if there was an error,
but correct it as well. This method involved mapping characters to
bit strings, as in ASCII. But instead of being sequential, the bit strings
are chosen so that each pair of strings are at least different in 3 bits.
A |
000000 |
B |
001111 |
C |
010011 |
Then, if one bit changes, it will be an illegal pattern. This is detection.
We can correct it because it will be 1 bit different from the pattern it
was supposed to be but 2 bits different from all others. So we find the
one it is closest to and use that. See page 50. These are called Hamming
codes, after their inventor. This particular code can detect 2 bit failures
and correct 1 bit ones. Adding more bits allows more detection and correction.
Mass Storage
We have seen how internal memory is laid out. Each cell in the main
memory is given a unique address. While you can randomly access each memory
cell and this access (both read and write) is quite fast, if power is lost,
so is the information. It is also relatively expensive. The most common
form of secondary storage is a hard disk.
Disk Storage
Information is stored in collections called files. Mass storage systems
usually involve moving mechanical parts. Access times for electronic main
memory is usually measured in nanoseconds. Disk access times are measured
in milliseconds. So there is a 3 order of magnitude difference. But disk
storage can be hundreds of time cheaper and retains the information without
power. Each disk drive consists of multiple platters which are metal
disks coated with magnetic material. They may be single or double sided,
although most today are double sided. Each surface has a read/write head
that floats just above the surface. The head can move in and out along
the radius of the platter. The platters are spinning on a spindle
at a high rate of speed. The surface is divided into concentric circles
called tracks. The same track on each platter is referred to as
a cylinder.
Disk performance is measured in the capacity and speed. The capacity
is measured in megabytes or gigabytes. The speed is measured in seek
time which is the time it takes the heads to move from track to track.
The latency is the average time it takes for a sector to rotate
under the head. This is half the rotation time. The access time
is the sum of these two. The transfer rate is how fast data can
be moved from the disk to memory. Some example values are for the Seagate
drive I showed in class. There are 10 platters on a spindle that rotates
at 5400 RPM. There are 17 (why?) heads and an average of 75 sectors per
track. The latency is 5.56 msecs. The seek time for one track is 1.7 msec
and the full range time is 22.5 msec. The average access time is 11.7 msecs.
The transfer rate is 10M/sec. The capacity is 1,600,930,800 bytes unformatted.
The formatted size is 1,370,545,152. This means that 230M of the disk is
used to hold information used to find other information on the disk.
This is about 14%. And this thing weighs about 8 pounds. This is a fairly
low capacity, somewhat slow and very heavy device.
Floppy and floppy-like devices
Floppy disks started out being 8 inches in diameter and have shrunk to
the current 3.5 inches. Current floppies have about 1.44 MB capacity. While
they are slow and small, they are very portable and are largely used for
file transfer and some backup. LS120 and Zip disks are the same size but
can hold 100 or 120 MB.
CD ROM
A cd rom is similar to a disk in that it is a spinning device. But it is
generally read only and uses lasers to read the information rather than
a floating magnetic sensor. The capacity is about 600 MB. The information
is laid out in one continuos track, although it is divided logically into
tracks and sectors to make it look to the system like a disk drive.
Some newer versions of the CD-ROM are writable. In either case,
this device provides reasonable access times with data permanence.
It is also easily produced and is portable.
Tape
A magnetic tape is a long strip of plastic tape coated with a magnetic
media. It works much like the cassette tape player. It is a good choice
for backup media due to its large capacity (20 GB is some cases) and low
cost. However, a tape is not randomly accessible. To get a piece of data
at the end of the tape, you must process the entire tape.