This is a short overview over the basic concepts of Correctness Checking (debugging).
The common approach of trial and error is usually not very efficient with large programs. It is easy to lose track of what has been tested and adjusted and small changes are easily forgotten when trying to recover an earlier state. It is therefore advisable to tackle errors and warnings systematically. A good first approach is always to keep a logbook (which can be a simple textfile) and write down exactly what error message came up, if and how it was reproducible, and actions taken to understand what happened (e.g. core file). Bug tracking systems (e.g. Bugzilla, trac) can also be helpful.
While debugging it is also recommended to simplify the program, i.e. reducing the input size, reducing the number of processes etc. and also reducing the number of compiler warnings. These test cases can be reused when making changes later on so you might want to keep them. Please note that when using parallel programming, it is important to keep a sufficiently large number of processors, as issues like data races will not occur otherwise.
Different testing methods have been established for developing and altering a program, two of which are unit and regression testing. During unit testing the program is broken down into multiple smallest individually testable parts to simplify checking them. Regression testing is of importance when altering fully functional programs, as it is often not sufficient to check only the parts directly affected by the changes. Upon locating and fixing bugs, it is advisable to keep the tests involved in the process to repeatedly check for similar bugs after making changes.
It is also advisable to use a source control manager (e.g. git, svn) to be able to recover an earlier state of the program.
Small mistakes can create very interesting error messages that are often also completely unrelated to the point where they actually happened. A list of things you may want to look out for:
- Are all variables initialized?
- Are there unused variables? (written but never read)
- Is there a part in the code that is never reached? (e.g. broken if-statement)
- Beware of pointers
- What are the defaults on the system you are using? (e.g. stack size too small)
There are usually various debugging tools available on a cluster, which can mostly be divided into two types, interpretive and direct execution. The former more or less works on the source code and machine code level and simulates parts of the program while the latter is attached to the program and monitors the internal state of it during runtime. The most common strategies are line by line execution or the use of breakpoints to skip the monitoring of longer and irrelevant parts.
Tools that utilize direct execution / dynamic analysis are MUST for programs parallelized with MPI, or the GNU command-line debugger GDB where you can set breakpoints and look into the source code that's being executed.