CS 255 Notes 2

Introduction

This time we will start with representing symbols and binary arithmetic and then go into how information is recorded on different kinds of storage.

Symbol Representation

Since the title of the class includes information representation, here we discuss how to represent two simple kinds of data, characters and numbers.

Characters

In the early days, most computer manufacturers picked their own character set. This was a mapping between the characters on the keyboard and the binary numbers the machine knew how to store. For example, UNIVAC and Control Data each chose a 6 bit number to represent each character. This allowed only upper case characters. Later came a 7 bit standard known as ASCII (American Standard Code for Information Interchange). IBM had a similar system known as EBCDIC (Extended Binary Coded Decimal IC).  Using 7 bits allowed the computer to represent 128 characters. This included upper and lower case, numbers, punctuation and control characters. Control characters got their names from that fact that on old Teletype systems, these bit patterns would actually change the behavior of the terminal. They still do to some extent. The number 7, seen as a character, causes the terminal to make a sound. However, as time has gone on, we have found that it would be better if we could represent more characters. So ASCII was extended to 8 bits, giving us 255 characters. This is not nearly enough to support some Asian writing systems, so a new standard is being developed called Unicode which uses 16 bits.There is also a rival standardization effort that uses 32 bits.

Numbers

Simply storing numbers in binary notation works but doesn't allow negative numbers. Storing the digits as characters uses up a lot of space. We would use 2 bytes to  store a 2 digit number. But 216 = 65K. The largest 2 digit number is 99. Not very efficient. We and the book will discuss two techniques, excess notation and twos complement.

Excess Notation

Now, I believe in moderation in all things, including notation. The term here refers to a technique for mapping decimal numbers to binary bit patterns. First, all integers that we are going to work with are represented as fixed length bit patterns. For example, if we are using a 4 bit pattern, the pattern 1000 is the middle of all possible 4 bit permutations. So it becomes the code for zero. The patterns are used like this.
 
 
1000
=>
0
0000 => -8
1001
=>
1
0001 => -7
1010
=>
2
0111 => -1
All negative numbers start with zero. See page 39 for the complete table. This 4 bit example is called excess 8 because the difference between the  bit pattern used here and the binary  number interpretation is 8.  To convert from decimal to excess 8, add 8 to the number, convert to binary and pad with zero on the left to get 4 bits.

Twos Complement

This technique also uses a fixed width bit string to represent integers. Positive values start at zero and count up in binary to 011111. Negative numbers start at 111111 and count down until 100000. All negative numbers start with a one. This is called the sign bit.

Coding/Decoding

Positive numbers are coded as a sign bit of zero and the rest of the bits are used to store the number in binary. So +6 is stored as 0110 in a 4 bit system. A -6 is stored by first complementing the positive representation and then adding 1. The complement of a number is the same number of bits with each 1 changed to zero and each zero changed to 1. In this case, 0110 becomes 1001. Then we add 1 to it so -6 is stored as 1010. The book has a different algorithm on page 41.

Addition

Now for some arithmetic with 2's complement numbers. This is very similar to doing binary arithmetic except that the result has to be the same length as the operands. So any carries to the left are deleted.

2+2=4
0010 + 0010 = 0100 = 4

2 + (-3) = 1
0010 + 1101 = 1111 = -1

4 + (-3) = 1
0100 + 1101 = 10001 = 0001  = 1

-2+(-2) = -4
1110 + 1110 = 11100 = 1100 = -4
One thing to notice is that we can do subtraction by a combination of negation and addition. This is because a-b = a + (-b). We can do multiplication and division this way as well since these are simply repeated additions and subtractions.This is probably not the most efficient technique.

Overflow

Lets try another arithmetic problem.
5 + 4 = 9
0101 + 0100 = 1001 = -7
7 + 7 = 14
0111 + 0111 = 1110 = -2
This is a case of overflow. With only 4 bits, we can only represent  values from -8 to +7. But notice that if we interpret the results above as binary numbers
0101 + 0100 = 1001 = 9
0111 + 0111 = 1110 = 14

The overflow problem is not really solvable using a fixed number of bits. But modern computers use 32 bits for integers which ranges up to +2,147,483,647, which is fine for most things. You can switch to double precision which gets you 64 bits. Another technique is to change the scale. That is, instead of measuring in grams, use kilograms.
Another approach is to use infinite precision arithmetic instead of integer arithmetic. One example of this is binary coded decimal (BCD).
This recorded each digit of a number as a separate bit pattern in memory. So for example, instead of recording 12 as 1100, we would record the 1 as 0001 and the 2 as 0010 in two separate memory locations. Actually, we could store 2 digits per byte. Then we do arithmetic digit by digit. Now the only limits are the amount of memory.

Floating Point Numbers

Floating point numbers are real numbers. Not like those fake integers. These are numbers like 3.14159265358. The technique for storing this in memory is based on scientific notation. This is the method that turns 314,000 into 3.14 x 105 . The 3.14 part is called the mantissa and the 5 part is called the exponent. Since we aren't using base 10 ( if someone invented a 10 state logic device, this would be easier), we use powers of 2 and binary numbers. Also, we don't have a decimal point, we have the more general radix point.  For these examples, we are working with 8 bit floating point numbers. The layout of the 8 bits is a 1 bit sign, a 3 bit exponent, stored in 3 bit excess notation, and a 4 bit mantissa, stored in binary.
So, to convert a ordinary number, like 3 1/4 (3.25) to floating point, we first convert it to binary. This works out as 11.0100.
Then we pretend that the radix point is at the left of the number. This looks like .1101 x 22 .
Now we copy the bits to the mantissa. Making it look like
 
Sign Exponent Mantissa
1 1 0 1

Now, we convert the exponent to excess notation. Since we moved the radix point 2 digits to the left, the exponent is 2. In the 3 bit excess notation, 2 is 110. So we put this in the exponent part.
 

Sign Exponent Mantissa
1 1 0 1 1 0 1
Since we started with a positive number, the sign bit is 0. Finally,
 
Sign Exponent Mantissa
0 1 1 0 1 1 0 1

Looked at another way,  .1101 x 22 is 1/16 + 0/8 + 1/4 + 1/2 = 1/16 + 0/16 + 4/16 + 8/16 = 13/16 * 4 = 52/16 = 3.25
Lets go the other direction. Given a bit pattern 01101011, what number is it?
sign bit is 0 so it is positive. The exponent is 110, which is 2. The mantissa is .1011. So the whole number is .1011 x 22 or 10.11.
This is 2 + 0 + 1/2 + 1/4 = 2 3/4 = 2.75.

Why is the exponent stored in excess notation? The claim is that we can determine the relative size of two numbers faster. Take two floating point numbers, A=00101010 and B=00011001. Which is bigger? The process is to scan the digits from the left, the first number to have a 1 digit is larger. So in this case, it would say that A is larger. Both of these are positive numbers. The exponent of A is 010 or -2 and the exponent of B is 001 or -3. That alone makes B smaller. This works because the bit patterns in excess notation have more 1s in bigger numbers. See page 39. If the exponents were in a kind of binary  notation, A would be 110 and B would be 111, so this trick wouldn't work.

Round off and other errors

Code 3 5/8 in 8 bit floating point. First in binary, 11.101. Copy this to the 4 bit mantissa we get
 
Sign Exponent Mantissa
1 1 1 0

We have lost the last 1. This means we are off by 1/8. Continuing we get
 

Sign Exponent Mantissa
0 1 1 0 1 1 1 0

If we convert this back, we get 3.5. This is a kind   overflow problem since we copied a 5 bit string into a 4 bit field. The obvious fix is to increase the size of the floating point variable. Most current machines use 32 bits. You can get 64 using doubles.
Another kind of arithmetic error is imprecision. 1/3 in decimal is a non termination number. 1/10 doesn't terminate in binary.
Something like the BCD in the integer world can be done, also it matters what order operations are done in. See page 47 for an example. A lot of work has been done in the study of numeric methods to figure out what can be done to minimize these kinds of errors.

Communications Errors

It doesn't take much to flip a bit. Cosmic rays, radioactive materials in the chip packaging or a thunderstorm near the phone lines. One simple technique to catch errors is called parity. This involves adding on extra bit to each cell.  If we are using odd parity, then this bit
is set to give the cell an odd number of ones. If a bit gets flipped, then there will be a even number of ones and we will know about the error. We won't be able to tell what the error was. Even parity is similar. This method is used heavily in main memory. On the memory card I showed, there are 9 chips to give us 256K 9 bit cells. 8 bit bytes and a parity bit.

For longer bit strings, we might use a parity byte, Each bit is used as a parity bit for a part of the string. Or we could use a checksum. This does some arithmetic on the bit string. If there is a change in the bits. the sum we sent won't match the sum calculated at the other end.

If we add even more bits, we can not only tell if there was an error, but correct  it as well. This method involved mapping characters to bit strings, as in ASCII. But instead of being sequential, the bit strings are chosen so that each pair of strings are at least different in 3 bits.
 

A 000000
B 001111
C 010011
Then, if one bit changes, it will be an illegal pattern. This is detection. We can correct it because it will be 1 bit different from the pattern it was supposed to be but 2 bits different from all others. So we find the one it is closest to and use that. See page 50. These are called Hamming codes, after their inventor. This particular code can detect 2 bit failures and correct 1 bit ones. Adding more bits allows more detection and correction.

Mass Storage

  We have seen how internal memory is laid out. Each cell in the main memory is given a unique address. While you can randomly access each memory cell and this access (both read and write) is quite fast, if power is lost, so is the information. It is also relatively expensive. The most common form of secondary storage is a hard disk.

Disk Storage

Information is stored in collections called files. Mass storage systems usually involve moving mechanical parts. Access times for electronic main memory is usually measured in nanoseconds. Disk access times are measured in milliseconds. So there is a 3 order of magnitude difference. But disk storage can be hundreds of time cheaper and retains the information without power. Each disk drive consists of multiple platters which are metal disks coated with magnetic material. They may be single or double sided, although most today are double sided. Each surface has a read/write head that floats just above the surface. The head can move in and out along the radius of the platter. The platters are spinning on a spindle at a high rate of speed. The surface is divided into concentric circles called tracks. The same track on each platter is referred to as a cylinder.
Disk performance is measured in the capacity and speed. The capacity is measured in megabytes or gigabytes. The speed is measured in seek time which is the time it takes the heads to move from track to track. The latency is the average time it takes for a sector to rotate under the head. This is half the rotation time. The access time is the sum of these two. The transfer rate is how fast data can be moved from the disk to memory. Some example values are for the Seagate drive I showed in class. There are 10 platters on a spindle that rotates at 5400 RPM. There are 17 (why?) heads and an average of 75 sectors per track. The latency is 5.56 msecs. The seek time for one track is 1.7 msec and the full range time is 22.5 msec. The average access time is 11.7 msecs. The transfer rate is 10M/sec. The capacity is 1,600,930,800 bytes unformatted. The formatted size is 1,370,545,152. This means that 230M of the disk is used to hold information used to find other information on the disk.  This is about 14%. And this thing weighs about 8 pounds. This is a fairly low capacity, somewhat slow and very heavy device.

Floppy and floppy-like devices

Floppy disks started out being 8 inches in diameter and have shrunk to the current 3.5 inches. Current floppies have about 1.44 MB capacity. While they are slow and small, they are very portable and are largely used for file transfer and some backup. LS120 and Zip disks are the same size but can hold 100 or 120 MB.

CD ROM

A cd rom is similar to a disk in that it is a spinning device. But it is generally read only and uses lasers to read the information rather than a floating magnetic sensor. The capacity is about 600 MB. The information is laid out in one continuos track, although it is divided logically into tracks and sectors to make it look to  the system like a disk drive. Some newer versions of the CD-ROM are writable. In either case,
this device provides reasonable access times with data permanence. It is also easily produced and is portable.

Tape

A magnetic tape is a long strip of plastic tape coated with a magnetic media. It works much like the cassette tape player. It is a good choice for backup media due to its large capacity (20 GB is some cases) and low cost. However, a tape is not randomly accessible. To get a piece of data at the end of the tape, you must process the entire tape.