Efficient Algorithms

Efficient Algorithms. It consists of showing how algorithms compare in terms of run-time efficiency. The search in an array is taken as an example, comparing the efficiency of the sequential search with the binary search .

Table of Contents

Summary

[ hide ]

1 Sequential search in arrays
2 Motivation
- 1 Binary search in arrays
- 2 Sequential search analysis
- 3 Conclusion
- 4 Observation
- 5 Recursive binary search
- 6 Sources

Sequential search in arrays

Motivation

There is a post.dat file with the data of the students who gave the PAA. Each line of this file contains the identity card (ci) and the name of an applicant. There is a second mat.dat file with the students’ scores on the specific math test. Each line contains a student’s ci followed by his score. For example, the content of both files could be:

post.dat mat.dat ——– ——- 11456789-6: Juan Gonzalez: … 12245894-7: 715 12245894-7: Pedro Fernandez: … 11478387-1: 801 … …

From the file it is concluded that Pedro Fernandez obtained 715 points. Also note that not all students take the math test so the mat.dat file contains fewer lines than post.dat. The university applicants are 150,000 students and those who take the math test about 100,000. The problem is to produce a post-mat.dat file that contains a line for each applicant who takes the math test with their name and their score. For example:

12245894-7: Pedro Fernandez: 715 11478387-1: Alberto Fuenzalida: 612 …

Reading a file with delimiters So far the files that we have used in the course are structured so that each data field has a fixed length. Therefore, it is easy to extract the entire line field by means of the substring function. However, this mechanism is too fragile as it is easy to mistake the exact position of the data in the file.

For this reason, a more robust mechanism for reading files with data fields was included in the course library. The idea is that files should be structured so that the data fields are separated by a delimiter (typically “:”). This delimiter is specified when creating the file reader or writer. When reading a file, the TextReader class transparently takes care of locating the delimiter and extracting the data field. Similarly, when writing to the file, the TextWriter class transparently adds the delimiter character to separate the fields.

As an example, the following code reads the post.dat file and leaves the content in two arrays of cis names and names:

TextReader lect = new TextReader (“post.dat”, “:”); String [] cis = new String [1000]; String [] names = new String [1000]; int i = 0; while (! lect.eofReached ()) { cis [i] = lect.readString (); names [i] = lect.readString (); lect.skip (); i ++; } int nalum = i;

In general, the reading pattern that we will use from here on is:

TextReader lect = new TextReader (… name …, “:”); … while (! lect.eofReached ()) {// terminate if file finished … = lect.read … (); // read a field … = lect.read … (); // read another field … // the rest of the fields lect.skip (); // change line … // process the line }

Likewise, to write to a separate file with delimiters the pattern will be used:

TextWriter write = new TextWriter (… name …, “:”); … while (…) { … write.print (…); // write a field write.print (…); // write another field … write.println (…); // write the last field and change the line }

First solution (inefficient) For each line in the mat.dat file:

The ci of the applicant is obtained. It looks for that ci in the file post.dat. The ci, the name and the score are written in post-mat. Problem: With modern processors and disks, a file with 150,000 lines can be read in approximately one second. But since it has to be read 100,000 times, the program will take 100,000 seconds. That is 28 hours! This time is unacceptable.

Array Sequential Search Reading a 150,000-line file from disk is inefficient. Time can be improved by reading all 150,000 lines and placing them in 2 arrays in memory:

With this, the program does not change in essence, only now each ci is searched in memory and not on disk. As accessing data in memory is more efficient than accessing it on disk, the time can be reduced by a factor of 10. In other words, the time can be reduced to just under 3 hours (which is still unacceptable). To solve the problem, it is decided to write a function that looks for a string x in an array a and returns what position it is in:

int findSec (String x, String [] a, int n) { int i = 0; while (i <n) { if (compare (x, a [i]) == 0) return i; i = i + 1; } return -1; // Not found in array }

This type of search is called a sequential search in arrays, because the elements are visited in sequence: element 0, 1, 2, …, and so on until the element that contains the searched value is found. How many iterations does the loop take?

Worst case: n iterations. Best case: 1 Average case: n / 2 approx. To specify the solution to the problem, the program that writes the post-mat.txt file is:

lect = new TextReader (“mat.dat”, “:”); TextWriter write = new TextWriter (“post-mat.dat”, “:”); while (! lect.eofReached ()) { String card = lect.readString (); int ptje = lect.readInt (); lect.skip (); // We obtain the card position in the cis array int k = searchSec (card, cis, nalum); if (k == – 1) println (“card not found” + card); else { write.print (card); write.print (names [k]); write.println (ptje); } }

Binary search in arrays

Binary search: If the cis array is ordered from least to greatest (in lexicographic order), a much more efficient search function can be written. To find x in an array a:

Compare x to the element in the middle. If x is equal, then that is the index. If x is less, look in the first half. If x is greater, look in the second half. If at some point the search range becomes empty, then x is not in the array. The function is:

int searchBin (String x, String [] a, int n) { int imin = 0; int imax = n-1; while (imin <= imax) { int center = (imin + imax) / 2; int comp = compare (x, a [center]); if (comp == 0) return center; if (comp <0) imax = center-1; else imin = center + 1; } return -1; // Not found in array }

This type of search is called a binary search, because the size of the search interval is divided by 2 at each iteration. Note that the function has more lines than sequential search, and that each iteration requires more work to be done. The question is then, which is better?

Sequential search analysis

The analysis of an algorithm consists of estimating the time it takes that algorithm based on the size of the data that it must manipulate. It is usually preferred to express these times in a unit independent of processor speed. For example, for sequential search, the time it takes to execute one iteration of the loop can be used as a unit. In this case, that time is a constant value (although it varies depending on the processor speed, but not significantly).

Typically, the criteria used to compare algorithms are execution times for large n values, where n is the size of the data. O (f (n)) notation To indicate the complexity of an algorithm, the notation O (f (n)) is used. It is said that an algorithm takes time O (f (n)) if there is a k and c such that for all n> = k it is always verified: execution time for size n <= c * f (n) This means that the time of execution grows at most at the rate of f (n) except for a constant factor.

Therefore, the sequential search takes, in the worst case, O (n) time, that is, time proportional to the number of elements in the array. In the average case it takes O (n / 2) time but this is the same as O (n).

So if you need to search for an item, the sequential search takes a reasonable time, but don’t forget that 100,000 such searches are needed to solve our initial problem, with n = 150,000.

Analysis: Best case: 1 iteration. Worst case: log2 n iterations. Average case: a little better than log2 n. Therefore the execution time takes time O (log n). For the case where n is 150000, a binary search will need only 18 iterations of the loop (worst case). That is, 4000 times fewer iterations than in sequential search. But each iteration is now a bit more complex. Let’s say 2 times more complex and therefore takes twice as long. So let’s compare the times when performing 100,000 sequential searches and 100,000 binary searches:

Type of Iterations in Iterations Cost by Time Search function of n when n = 150,000 iteration Total Sequential O (n) 75000 (average) 1 2 hours Binary O (log2 n) 18 2 3.6 seconds Problem of the binary search: it is necessary to order previously the arrangement by identity card. We will soon see that there are very efficient sorting algorithms with execution times O (n log n).

conclusion

When comparing algorithms to solve the same problem, the essential thing is to compare the growth rate of their execution times and not how many lines each program has or how complex each iteration is. The growth rate of the execution time is expressed by means of the notation O (f (n)). With this, it is not saying that time is exactly f (n), but rather that at most it grows at the speed of f (n). If two algorithms F and G take time O (f (n)) and O (g (n)) respectively. F is said to be better than G if:

limit of F (n) / G (n) -> 0 when n -> infinity

Therefore the binary search is better than the sequential search because log (n) / n tends to 0. In practice this means that it does not matter that an algorithm F is slower than G for small values of n, which what is really relevant for users is which algorithm behaves best when they must process large volumes of data.

Observation

The problem of determining the speed of growth of the execution time of an algorithm is one of the complex problems of computation. It is usually even more difficult than programming the algorithm. In this course, we will limit ourselves to estimating this speed for simple algorithms (and it will not be asked during the checks).

Recursive binary search

There is a variant of searching in an ordered array that can be implemented naturally using recursion. The searchBinRec function searches for a string x in an array of string a, in a range of indices [imin, imax]. This function also returns the index where x is located. The recursive algorithm is:

Base case:

If imin> imax, the interval is empty, so x is not in the array in that interval and -1 is returned. If not, we compute centro = (imin + imax) / 2 and compare x to [center]. If they are equal, x is in a [center] so center is returned.

Recursive case:

If x <a [center] the interval [imin, center-1] is searched recursively. If x> a [centre] it is recursively searched in the interval [centre + 1, imax]. The function is:

int searchBinRec (String x, String [] a, int imin, int imax) { if (imin> imax) return -1; int center = (imin + imax) / 2; int comp = compare (x, a [center]); if (comp == 0) return center; if (comp <0) return findBinRec (x, a, imin, center-1); else return findBinRec (x, a, centro + 1, imax); }

Analysis: this algorithm is identical to the non-recursive binary search, therefore the execution times are equivalent, except for a constant factor. In effect, the recursive solution is obtained by replacing the iterations of the non-recursive solution by a call to a function. Function calls are slightly more expensive than iterations in a loop.

Therefore, if the concern is run-time efficiency, non-recursive solutions should be preferred. However, there are problems that are very difficult to bring to non-recursive solutions (think of the towers of Hanoi). When the concern is the clarity of a solution, it is common to opt instead for recursion.