Correctness checking

From HPC Wiki
Jump to navigation Jump to search

This is a short overview over the basic concepts of Correctness Checking (debugging).

General

The common approach of trial and error is usually not very efficient with large programs. It is easy to lose track of what has been tested and adjusted and small changes are easily forgotten when trying to recover an earlier state. It is therefore advisable to tackle errors and warnings systematically. A good first approach is always to keep a logbook (which can be a simple textfile) and write down exactly what error message came up, if and how it was reproducible, and actions taken to understand what happened (e.g. core file). Bug tracking systems (e.g. Bugzilla, trac) can also be helpful.

While debugging it is also recommended to simplify the program, i.e. reducing the input size, reducing the number of processes etc. and also reducing the number of compiler warnings. These test cases can be reused when making changes later on so you might want to keep them. Please note that when using parallel programming, it is important to keep a sufficiently large number of processors, as issues like data races will not occur otherwise.

It is also advisable to use a source control manager (e.g. git, svn) to be able to recover an earlier state of the program.

Common issues

Small mistakes can create very interesting error messages that are often also completely unrelated to the point where they actually happened. A list of things you may want to look out for:

  • Are all variables initialized?
  • Are there unused variables? (written but never read)
  • Is there a part in the code that is never reached? (e.g. broken if-statement)
  • Beware of pointers
  • What are the defaults on the system you are using? (e.g. stack size too small)

Debugging Tools

There are usually various debugging Tools available on a cluster, which can mostly be divided into two types, interpretive and direct execution. The former more or less works on the source code and machine code level and simulates parts of the program while the latter is attached to the program and monitors the internal state of it during runtime. The most common strategies are line by line execution or the use of breakpoints to skip the monitoring of longer and irrelevant parts.