Analysis of Algorithms:Introduction,Operation Counts,Step Counts and Counting Cache Misses.
Introduction
The topic “Analysis of Algorithms” is concerned primarily with determining the memory (space) and time requirements (complexity) of an algorithm. Since the techniques used to determine memory requirements are a subset of those used to determine time requirements, in this chapter, we focus on the methods used to determine the time complexity of an algorithm.
The time complexity (or simply, complexity) of an algorithm is measured as a function of the problem size. Some examples are given below.
1. The complexity of an algorithm to sort n elements may be given as a function of n.
2. The complexity of an algorithm to multiply an m × n matrix and an n × p matrix may be given as a function of m, n, and p.
3. The complexity of an algorithm to determine whether x is a prime number may be given as a function of the number, n, of bits in x. Note that n = plog2(x + 1)l.
We partition our discussion of algorithm analysis into the following sections.
1. Operation counts.
2. Step counts.
3. Counting cache misses.
4. Asymptotic complexity.
5. Recurrence equations.
6. Amortized complexity.
7. Practical complexities.
See [1, 3–5] for additional material on algorithm analysis.
Operation Counts
One way to estimate the time complexity of a program or method is to select one or more operations, such as add, multiply, and compare, and to determine how many of each is done. The success of this method depends on our ability to identify the operations that contribute most to the time complexity.
Example 1.1
[Max Element] Figure 1.1 gives an algorithm that returns the position of the largest element in the array a[0:n-1]. When n > 0, the time complexity of this algorithm can be estimated by determining the number of comparisons made between elements of the array a. When n ≤ 1, the for loop is not entered. So no comparisons between elements of a are made.
When n > 1, each iteration of the for loop makes one comparison between two elements of a, and the total number of element comparisons is n-1. Therefore, the number of element comparisons is max{n-1, 0}. The method max performs other comparisons (for example, each iteration of the for loop is preceded by a comparison between i and n) that are not included in the estimate. Other operations such as initializing position Of Current Max and incrementing the for loop index i are also not included in the estimate.
int max(int [] a, int n)
{
if (n < 1) return -1; // no max i
nt positionOfCurrentMax = 0;
for (int i = 1; i < n; i++)
if (a[positionOfCurrentMax] < a[i]) positionOfCurrentMax = i;
return positionOfCurrentMax;
}
FIGURE 1.1: Finding the position of the largest element in a[0:n-1].
The algorithm of Figure 1.1 has the nice property that the operation count is precisely determined by the problem size. For many other problems, however, this is not so. Fig- ure 1.2 gives an algorithm that performs one pass of a bubble sort. In this pass, the largest element in a[0:n-1] relocates to position a[n-1]. The number of swaps performed by this algorithm depends not only on the problem size n but also on the particular values of the elements in the array a. The number of swaps varies from a low of 0 to a high of n − 1.
Analysis of Algorithms
{
for (int i = 0; i < n - 1; i++)
if (a[i] > a[i+1]) swap(a[i], a[i+1]);
}
FIGURE 1.2: A bubbling pass.
Since the operation count isn’t always uniquely determined by the problem size, we ask for the best, worst, and average counts.
Example 1.2
[Sequential Search] Figure 1.3 gives an algorithm that searches a[0:n-1] for the first occurrence of x. The number of comparisons between x and the elements of a isn’t uniquely determined by the problem size n. For example, if n = 100 and x = a[0], then only 1 comparison is made. However, if x isn’t equal to any of the a[i]s, then 100 comparisons are made.
A search is successful when x is one of the a[i]s. All other searches are unsuccessful.
Whenever we have an unsuccessful search, the number of comparisons is n. For successful searches the best comparison count is 1, and the worst is n. For the average count assume that all array elements are distinct and that each is searched for with equal frequency. The average count for a successful search is
Example 1.3
[Insertion into a Sorted Array] Figure 1.4 gives an algorithm to insert an element x into a sorted array a[0:n-1].
We wish to determine the number of comparisons made between x and the elements of a. For the problem size, we use the number n of elements initially in a. Assume that n ≥ 1.
The best or minimum number of comparisons is 1, which happens when the new element x
void insert(int [] a, int n, int x)
{
// find proper place for x
int i;
for (i = n - 1; i >= 0 && x < a[i]; i--)
a[i+1] = a[i];
a[i+1] = x; // insert x
}
FIGURE 1.4: Inserting into a sorted array.
is to be inserted at the right end. The maximum number of comparisons is n, which happens when x is to be inserted at the left end. For the average assume that x has an equal chance of being inserted into any of the possible n+1 positions. If x is eventually inserted into position i+1 of a, i ≥ 0, then the number of comparisons is n-i. If x is inserted into a[0], the number of comparisons is n. So the average count is
This average count is almost 1 more than half the worst-case count.
Step Counts
The operation-count method of estimating time complexity omits accounting for the time spent on all but the chosen operations. In the step-count method, we attempt to account for the time spent in all parts of the algorithm. As was the case for operation counts, the step count is a function of the problem size.
A step is any computation unit that is independent of the problem size. Thus 10 additions can be one step; 100 multiplications can also be one step; but n additions, where n is the problem size, cannot be one step. The amount of computing represented by one step may be different from that represented by another. For example, the entire statement
return a+b+b*c+(a+b-c)/(a+b)+4;
can be regarded as a single step if its execution time is independent of the problem size.
We may also count a statement such as
x = y;
as a single step.
To determine the step count of an algorithm, we first determine the number of steps per execution (s/e) of each statement and the total number of times (i.e., frequency) each statement is executed. Combining these two quantities gives us the total contribution of each statement to the total step count. We then add the contributions of all statements to obtain the step count for the entire algorithm.
Example 1.4
[Sequential Search] Tables 1.1 and 1.2 show the best- and worst-case step-count analyses for sequentialSearch (Figure 1.3).
For the average step-count analysis for a successful search, we assume that the n values in a are distinct and that in a successful search, x has an equal probability of being any one of these values. Under these assumptions the average step count for a successful search is the sum of the step counts for the n possible successful searches divided by n. To obtain this average, we first obtain the step count for the case x = a[j] where j is in the range
[0, n − 1] (see Table 1.3).
Now we obtain the average step count for a successful search:
This value is a little more than half the step count for an unsuccessful search.
Now suppose that successful searches occur only 80 percent of the time and that each a[i] still has the same probability of being searched for. The average step count for sequential Search is
.8 ∗ (average count for successful searches) + .2 ∗ (count for an unsuccessful search)
= .8(n + 7)/2 + .2(n + 3)
= .6n + 3.4
Counting Cache Misses
Traditionally, the focus of algorithm analysis has been on counting operations and steps. Such a focus was justified when computers took more time to perform an operation than they took to fetch the data needed for that operation. Today, however, the cost of per- forming an operation is significantly lower than the cost of fetching data from memory. Consequently, the run time of many algorithms is dominated by the number of memory references (equivalently, number of cache misses) rather than by the number of operations. Hence, algorithm designers focus on reducing not only the number of operations but also the number of memory accesses. Algorithm designers focus also on designing algorithms that hide memory latency.
Consider a simple computer model in which the computer’s memory consists of an L1 (level 1) cache, an L2 cache, and main memory. Arithmetic and logical operations are per- formed by the arithmetic and logic unit (ALU) on data resident in registers (R). Figure 1.5 gives a block diagram for our simple computer model.
Typically, the size of main memory is tens or hundreds of megabytes; L2 cache sizes are typically a fraction of a megabyte; L1 cache is usually in the tens of kilobytes; and the number of registers is between 8 and 32. When you start your program, all your data are in main memory.
To perform an arithmetic operation such as an add, in our computer model, the data to be added are first loaded from memory into registers, the data in the registers are added, and the result is written to memory.
Let one cycle be the length of time it takes to add data that are already in registers. The time needed to load data from L1 cache to a register is two cycles in our model. If the required data are not in L1 cache but are in L2 cache, we get an L1 cache miss and the required data are copied from L2 cache to L1 cache and the register in 10 cycles. When the required data are not in L2 cache either, we have an L2 cache miss and the required data are copied from main memory into L2 cache, L1 cache, and the register in 100 cycles. The write operation is counted as one cycle even when the data are written to main memory because we do not wait for the write to complete before proceeding to the next operation. For more details on cache organization, see [2].
Effect of Cache Misses on Run Time
For our simple model, the statement a = b + c is compiled into the computer instructions
load a; load b; add; store c;
where the load operations load data into registers and the store operation writes the result of the add to memory. The add and the store together take two cycles. The two loads may take anywhere from 4 cycles to 200 cycles depending on whether we get no cache miss, L1 misses, or L2 misses. So the total time for the statement a = b + c varies from 6 cycles to 202 cycles. In practice, the variation in time is not as extreme because we can overlap the time spent on successive cache misses.
Suppose that we have two algorithms that perform the same task. The first algorithm does 2000 adds that require 4000 load, 2000 add, and 2000 store operations and the second algorithm does 1000 adds. The data access pattern for the first algorithm is such that 25 percent of the loads result in an L1 miss and another 25 percent result in an L2 miss. For our simplistic computer model, the time required by the first algorithm is 2000 ∗ 2 (for the 50 percent loads that cause no cache miss) + 1000 ∗ 10 (for the 25 percent loads that cause an L1 miss) + 1000 ∗ 100 (for the 25 percent loads that cause an L2 miss) + 2000 ∗ 1 (for the adds) + 2000 ∗ 1 (for the stores) = 118,000 cycles. If the second algorithm has 100 percent L2 misses, it will take 2000 ∗ 100 (L2 misses) + 1000 ∗ 1 (adds) + 1000 ∗ 1 (stores) = 202,000 cycles. So the second algorithm, which does half the work done by the first, actually takes 76 percent more time than is taken by the first algorithm.
Computers use a number of strategies (such as preloading data that will be needed in the near future into cache, and when a cache miss occurs, the needed data as well as data in some number of adjacent bytes are loaded into cache) to reduce the number of cache misses and hence reduce the run time of a program. These strategies are most effective when successive computer operations use adjacent bytes of main memory.
Although our discussion has focused on how cache is used for data, computers also use cache to reduce the time needed to access instructions.
Matrix Multiplication
The algorithm of Figure 1.6 multiplies two square matrices that are represented as two- dimensional arrays. It performs the following computation:
Figure 1.7 is an alternative algorithm that produces the same two-dimensional array c as is produced by Figure 1.6.
We observe that Figure 1.7 has two nested for loops that are not present in Figure 1.6 and does more work than is done by Figure 1.6 with respect to indexing into the array c. The remainder of the work is the same.
You will notice that if you permute the order of the three nested for loops in Figure 1.7, you do not affect the result array c. We refer to the loop order in Figure 1.7 as ijk order. When we swap the second and third for loops, we get ikj order. In all, there are 3! = 6 ways in which we can order the three nested for loops. All six orderings result in methods that perform exactly the same number of operations of each type. So you might think all six take the same time. Not so. By changing the order of the loops, we change the data access pattern and so change the number of cache misses. This in turn affects the run time.
In ijk order, we access the elements of a and c by rows; the elements of b are accessed by column. Since elements in the same row are in adjacent memory and elements in the same column are far apart in memory, the accesses of b are likely to result in many L2 cache misses when the matrix size is too large for the three arrays to fit into L2 cache. In ikj order, the elements of a, b, and c are accessed by rows. Therefore, ikj order is likely to result in fewer L2 cache misses and so has the potential to take much less time than taken by ijk order.
For a crude analysis of the number of cache misses, assume we are interested only in L2 misses; that an L2 cache-line can hold w matrix elements; when an L2 cache-miss occurs, a block of w matrix elements is brought into an L2 cache line; and that L2 cache is small compared to the size of a matrix. Under these assumptions, the accesses to the elements of a, b and c in ijk order, respectively, result in n3/w, n3, and n2/w L2 misses. Therefore, the total number of L2 misses in ijk order is n3(1 + w + 1/n)/w. In ikj order, the number of L2 misses for our three matrices is n2/w, n3/w, and n3/w, respectively. So, in ikj order, the total number of L2 misses is n3(2 + 1/n)/w. When n is large, the ration of ijk misses to ikj misses is approximately (1 + w)/2, which is 2.5 when w = 4 (for example when we have a 32-byte cache line and the data is double precision) and 4.5 when w = 8 (for example when we have a 64-byte cache line and double-precision data). For a 64-byte cache line and single-precision (i.e., 4 byte) data, w = 16 and the ratio is approximately 8.5.
Figure 1.8 shows the normalized run times of a Java version of our matrix multiplication algorithms.
In this figure, mult refers to the multiplication algorithm of Figure 1.6.
The normalized run time of a method is the time taken by the method divided by the time taken by ikj order.
Matrix multiplication using ikj order takes 10 percent less time than does ijk order when the matrix size is n = 500 and 16 percent less time when the matrix size is 2000. Equally surprising is that ikj order runs faster than the algorithm of Figure 1.6 (by about 5 percent when n = 2000). This despite the fact that ikj order does more work than is done by the algorithm of Figure 1.6.
Comments
Post a Comment