David Stanovský    // firstname.secondname@gmail.com   

How to do and present experiments (written for my Algorithms & Data Structures course @KBTU)

An important part of a programmer's job are experiments. It often happens that one problem has several solutions (algorithms) and it is not clear which approach is the best. Or, a more complex question, which solution is the best for which type of data. Then your boss comes and asks you to determine that. And the boss requires an honest, simple and clear presentation of the outcome of your research. Here is the guidlines you shall follow in your report.

1) Formulate the problem.
For instance, state that you investigate Algorithm A and Algorithm B on data of types X, Y and Z.
In our homework, you have one algorithm YourFavouriteSort, and three kinds of data: random, sorted and reverse sorted.

2) Formulate a hypothesis.
For instance, state that you expect that Algorithm B performs better on data of type X, and on types Y,Z, both algorithms will be roughly equal. Another example: state an expectation that the running time of Algorithm A on data of type X is quadratic.
In our homework, for example, you can formulate a hypothesis that MergeSort will perform the same way on all kinds of data, and the running time shall be O(n log n).

3) Run experiments.
Implement the algorithms, generate data and record run times, for instance into an excel table. Choose big enough data so that your algorithms run for at least 0.2 seconds. And small enough so that you don't wait a week for the outcome. For drawing a graph, about 10 points shall be sufficient to see the development.
In our homework, first determine N such that your program runs in about a second, and then run it on all types of data of sizes N, 2N, 3N, ..., 10N.

4) Present your data.
Usually plotting a graph (or more graphs) is the best way to present your data (your boss hates numbers and loves pictures!).

5) Make a conclusion.
Confirm or refute your hypothesis. If you see any other interesting features of the data, state them.

How do you recognize that your algorithm is linear, quadratic etc.? Let alone rigorous statistical methods such as regression (this is not part of our course), you can use the following very simple "qualified guess".

  • For example, consider complexity function T(n)=cn. Then T(2n)=2cn=2T(n), i.e., if you double the size of the input, the running time also doubles. Check if this applies to your data.
  • Now consider complexity function T(n)=cn^2. Then T(2n)=4cn^2=4T(n), i.e., if you double the size of the input, the running time quadruples. Check if this applies to your data.
  • Recognizing complexity function T(n)=cnlog(n) is trickier: T(2n)=2cnlog(2n)=2cn(log(n)+1)=2T(n)+2cn, i.e., if you double the size of the input, the running time a bit more than doubles, and the relative difference between T(2n) and 2T(n) tends to decrease as n grows.

Here is a a sample excel table with made up experimental data and graphs (feel free to reuse this file in your homework). Following the guidlines, you can confirm a hypothesis that Algorithm A runs on data type X in linear time, but on data type Y in quadratic time. Data type Z shows an nlog(n) behaviour: this is nearly impossible to recognize from the graph, but the numbers suggest so: while T(1000) is nearly 2T(500), the value T(400) is somewhat larger than 2T(200). (A warning again: this is just a "qualified guess", take a statistics course for a formal apporach into data analysis).