Sorting and Order Notes

Much of a computers time is spent sorting and searching through data. Enormous amounts of research effort have been spent making this more efficient. In this lecture, we will discuss some sorting and searching algorithms and how to compare them.

Order

The order of an algorithm is a measure of how efficient it is. It is not an equation that can be used to predict the running time of a program that implements this algorithm. It is similar to calculating limits in calculus. Only the highest order terms count.

A single statement is of order 1. This is written as O(1). This is known as 'big O' notation. This means that the time to execute this statement is constant. It may be a different constant from statement to statement and the actual time to run it will be different. But the time to run it is bounded by a constant and doesn't depend on the amount of data being processed.

A for loop is a different matter. For example,

for(int i=0; i<N; i++)
	x += i;   // this statement is O(1)
The loop runs N times and the body of it is O(1). So the order of the loop is O(N). This means that the time to execute the loop is proportional to the number of items. This tells us little about the actual number of instructions run or the time it takes. It tells is that the time increases linearly as N increases.

Now lets look at a nested loop.

for(int i=0; i<N; i++)
	for(int j=0; j<N; j++)
		x = j*i;   // this statement is O(1)
The inner loop runs N times for each execution of the outer loop. So it runs N*N times. Therefore the order of a nested loop is O(N2).

Now lets looks at part of a sort routine.

for(int i=N; i>0; i--)
	for(int j=0; j<i; j++)
		if(a[j] > a[j+1]) swap(a[j],a[j+1]);  // This is O(1)
The inner loop doesn't run to N, it runs to I. So it executed N-1,N-2,N-3,...,3,2,1 times. The sum of these is the total time the inner loop runs. This is the sum of an arithmetic series so the sum is N(N-1)/2.

So the algorithm is of order O((N2-N)/2). But like limits, we can throw out the -N term because it is small compared to the N2. And dividing by 2 doesn't matter either. So the algorithm is of order O(N2). The actual running time is slightly faster than the earlier nested loop (because of the -N term) but we are interested in upper bounds.

Searching

Here is a very simple searching algorithm called a linear search.

for(int i=0; i < N; i++)
	if(a[i] == target) {
		cout << "we found it\n";
		break;;
	}

This looks though the array until it finds the target. then it prints a message and exits the loop. In this case, we are going to look at three order calculations. We are interested in the best case, the worst case and the average case. The best case is the target is in a[0]. This is O(1). The worst case is the target is in a[N-1]; This is O(N); The average is O((N+1)/2). We don't simplify this to N since we are looking at the three cases.

Now lets look at a binary search. This search technique cuts the space it has to search in half each time. It is completely dependent on the suumption that the array is sorted before the search,

void
bsearch(int A[], int first, int size, int target, bool *found,int *loc);
{
	int middle;
	if(size == 0) {  // it isn't here
		found=false;
		return;
	}
	middle = (first + size) /2;
	if(target == A[middle]) {
		loc=middle;
		found=true;
		return;
	}
	if(target < A[middle])  // it is in the left half
		bsearch(A, first, size/2, target, found,int loc);
	else // it is in the right half
		bsearch(A, middle+1, (size-1)/2, target, found,int loc);
} // bsearch

Every time this is called, it searchs half the remaining space. So the time to search looks like N. N/2, N/4,...,4,2,1. If we tale the log base 2 of these numbers and add them, we get an arithmetic series like, log2, log2N-1, log2N-2, ..., log23, log22, log21.

After a little algebra, we get an order of O(log2 + 1). This simplifies to O(log2). So this is pretty fast in comparison to the linear search. However, there is a price. You must either sort the data first, or keep it sorted as it is used. This must be added to calculate the total cost of the algorithm. If the data can be kept sorted, the total cost is still less that a linear search of unordered data. If there are 64 items, the linear search time is about 32 while the binary search time is 6.