This is a short overview over the basic concepts of Correctness Checking (debugging).
The common approach of trial and error is usually not very efficient with large programs. It is easy to lose track of what has been tested and adjusted and small changes are easily forgotten when trying to recover an earlier state. It is therefore advisable to tackle errors and warnings systematically. A good first approach is always to keep a logbook (which can be a simple textfile) and write down exactly what error message came up, if and how it was reproducible, and actions taken to understand what happened (e.g. core file). Bug tracking systems (e.g. Bugzilla, trac) can also be helpful.
While debugging it is also recommended to simplify the program, i.e. reducing the input size, reducing the number of processes etc. and also reducing the number of compiler warnings. These test cases can be reused when making changes later on so you might want to keep them. Please note that when using parallel programming, it is important to keep a sufficiently large number of processors, as issues like data races will not occur otherwise.
Different testing methods have been established for developing and altering a program, two of which are unit and regression testing. During unit testing the program is broken down into multiple smallest individually testable parts to simplify checking them. Regression testing is of importance when altering fully functional programs, as it is often not sufficient to check only the parts directly affected by the changes. Upon locating and fixing bugs, it is advisable to keep the tests involved in the process to repeatedly check for similar bugs after making changes.
It is also advisable to use a source control manager (e.g. git, svn) to be able to recover an earlier state of the program.
Small mistakes can create very interesting error messages that are often also completely unrelated to the point where they actually happened. A list of things you may want to look out for:
- Are all variables initialized?
- Are there unused variables? (written but never read)
- Is there a part in the code that is never reached? (e.g. broken if-statement)
- Beware of pointers
- What are the defaults on the system you are using? (e.g. stack size too small)
There are usually various correctnes checking and debuggung tools available on a cluster, which can mostly be divided into two types, interpretive and direct execution. The former more or less works on the source code and machine code level and simulates parts of the program while the latter is attached to the program and monitors the internal state of it during runtime. The most common strategies are line by line execution or the use of breakpoints to skip the monitoring of longer and irrelevant parts.
Tools that utilize direct execution / dynamic analysis give you an option to stop the execution, set breakpoints and look into the source code and that's being executed including values of the variables at that time.
- TotalView, a comfortable GUI-based debugger
- GDB (command line debugger
- ARM (ex-Allinea) DDT, another GUI-based debugger
Correctness Checking Tools
With general debuggung tools being in general able to detect all kind of errors (you just must know where to look on, huh), there are some error/issue types which cannot be detected by an debugger in a comfortable way. Thus special tools are developed to address those kinds of issues.
- MUST for programs parallelized with MPI
- Compiler Sanitizers
- Intel Inspector , detects Data Races in OpenMP parallelised applications.
- Oracle Thread Analyzer (ex-SUN, no updates for a long time but still available and useful) - included in Oracle Studio,
- debuggers incorporate some kind of correctness checking tools, typically for heap memory (look out for corrupt memory)
- compilers often have compile-line options for more checks (both compile and run time) - e.g. for detection out-of-bound array access in Fortran