Stacks

Introduction

The purpose of building an index to a set of data is to provide a means of locating elements in the set without performing a linear search. One method of doing this is hashing.

Hashing is the process of transforming the key for a data record into an an address for it. In the ideal case, each key is a direct address for the record. In our case, if the key was the record number, this would be true. This would require one slot in the index for each possible record, even if not all possible records were in the database.

Generally, the hashing process involves transforming the key value into a number. This number is an index into a table that contains the address for the record. The transformation is one way. The hash value cannot be deciphered into a unique key. The transformation is many to one. The goal is to compress the space of possible keys into the space of hash values. For example, if the transformation or hash function, involves taking the key value modulo 13, then the range of key values is compressed into the integers 0 through 12.

Key Hashes to

1 1 1

2 10 10

3 13 0

4 14 1

5 998 10

	Key	Hashes to
1	1	1
2	10	10
3	13	0
4	14	1
5	998	10

The hash value is used as an index into the hash table.

When two keys, such as those in lines 1 and 4, have the same hash value, they are said to collide.

The goal of a hashing algorithm is to minimize collisions. Since this can't be reduced to zero, it must provide a means of handling the collisions.

Hash Tables

The hash table contains the information needed to resolve a hashed key into an address. The hash value is the index in the hash table where this information is kept. In our case, this will probably be both the key value and the disk address (or record number) of the record that contains the key. The entries in the table are initialized to a value that is outside the key space. This allows the algorithm to detect an empty slot.

Hashing Functions

A good hashing function produces a large compression of the space of key values, is easy to implement and fast to run. It should also produces a fairly even distribution of collisions. For our purposes, we will convert the key string to an integer and then use the modulo operator to reduce the space. A good rule of thumb is to use a prime number as the modulo base.

Collision Techniques

There are a large number of algorithms to deal with collisions. We will look at two general classes of them.

Open Address

Open address techniques search the table systematically to find an open slot to put the new key in. There are a number of search methods. In linear probing after a collision is detected, the algorithm looks linearly through the table until it finds an open slot. When a lookup is done, the key is hashed and the table searched in a similar way to find the entry. On problem with this mechanism is when a number of keys hash to the same value. Then, the collision chain from this value could take up slots that other keys hash to.

Quadratic hashing is an attempt to fix this. Instead of looking in the adjacent slot for a place to put a value after a collision, the algorithm looks farther away. This leaves gaps in the table so other keys are less likely to find some other value in their slot.

Problems

The biggest problem here is deciding how much space to allocate for the has table up front. If there is not enough, eventually some keys will be hashed and not be able to find open slots. Too much, and the space is wasted. One solution is to change the size of the hash table on the fly.

Deletion is a little tricky. The method we use to determine if a key is not already in the table is to follow the collision chain until we hit an empty slot. But if we delete a key by marking its slot empty, we could end a chain prematurely.

Chained Address

Chained addressing deals with several of these problems. The hash function is the same as before, but instead of storing the key at the hashed location, there is a pointer stored there to a linked list of all the values that hashed to the same place.

These pointers are all initialized to null. Collisions are handled by adding them to the list. Typically, they are added to the front of the list as this is fastest.

The table only needs to be as large as the hash value space. Deleting an entry means removing it from the list it is one. This has no affect on the other entries.

Problems

The lists could get long and searching a list is somewhat slower than searching an array. There is additional space used by the links.

Conclusions

The chained hashing technique is probably the simplest and most flexible. Hashing trades space for time. The search times can be almost independent of the number of items.