Stacks
Introduction
The purpose of building an index to a set of data is
to provide a means of locating elements in the set without
performing a linear search.
One method of doing this is hashing.
Hashing is the process of transforming the key for
a data record into an an address for it.
In the ideal case, each key is a direct address for the record.
In our case, if the key was the record number, this would be true.
This would require one slot in the index for each possible record,
even if not all possible records were in the database.
Generally, the hashing process involves transforming the key value
into a number.
This number is an index into a table that contains the address for the record.
The transformation is one way.
The hash value cannot be deciphered into a unique key.
The transformation is many to one.
The goal is to compress the space of possible keys
into the space of hash values.
For example, if the transformation or hash function, involves
taking the key value modulo 13, then the range of key values
is compressed into the integers 0 through 12.
| Key | Hashes to |
1 | 1 | 1 |
2 | 10 | 10 |
3 | 13 | 0 |
4 | 14 | 1 |
5 | 998 | 10 |
The hash value is used as an index into the hash table.
When two keys, such as those in lines 1 and 4, have the same
hash value, they are said to collide.
The goal of a hashing algorithm is to minimize collisions.
Since this can't be reduced to zero, it must provide a means
of handling the collisions.
Hash Tables
The hash table contains the information needed to resolve
a hashed key into an address.
The hash value is the index in the hash table where this
information is kept.
In our case, this will probably be both the key value and
the disk address (or record number) of the record
that contains the key.
The entries in the table are initialized to a value
that is outside the key space.
This allows the algorithm to detect an empty slot.
Hashing Functions
A good hashing function produces a large compression of the space
of key values, is easy to implement and fast to run.
It should also produces a fairly even distribution of collisions.
For our purposes, we will convert the key string to an integer
and then use the modulo operator to reduce the space.
A good rule of thumb is to use a prime number as the modulo base.
Collision Techniques
There are a large number of algorithms to deal with collisions.
We will look at two general classes of them.
Open Address
Open address techniques search the table systematically to find
an open slot to put the new key in.
There are a number of search methods.
In linear probing after a collision is detected, the algorithm
looks linearly through the table until it finds an open slot.
When a lookup is done, the key is hashed and the table searched
in a similar way to find the entry.
On problem with this mechanism is when a number of keys hash to the
same value.
Then, the collision chain from this value could take up
slots that other keys hash to.
Quadratic hashing is an attempt to fix this.
Instead of looking in the adjacent slot for a place
to put a value after a collision, the algorithm
looks farther away.
This leaves gaps in the table so other keys are less likely to
find some other value in their slot.
Problems
The biggest problem here is deciding how much space to allocate
for the has table up front.
If there is not enough, eventually some keys will
be hashed and not be able to find open slots.
Too much, and the space is wasted.
One solution is to change the size of the hash table on the fly.
Deletion is a little tricky.
The method we use to determine if a key is not already in the table
is to follow the collision chain until we
hit an empty slot.
But if we delete a key by marking its slot empty, we
could end a chain prematurely.
Chained Address
Chained addressing deals with several of these problems.
The hash function is the same as before, but instead
of storing the key at the hashed location, there is
a pointer stored there to a linked list of all the
values that hashed to the same place.
These pointers are all initialized to null.
Collisions are handled by adding them to the list.
Typically, they are added to the front of the list
as this is fastest.
The table only needs to be as large as the hash value space.
Deleting an entry means removing it from the list
it is one.
This has no affect on the other entries.
Problems
The lists could get long and searching a list is somewhat
slower than searching an array.
There is additional space used by the links.
Conclusions
The chained hashing technique is probably the simplest
and most flexible.
Hashing trades space for time.
The search times can be almost independent of the number of items.