Difference between revisions of "Runtime profiling"
Line 69: | Line 69: | ||
Runtime profiling with perf is very simple. You execute your executable wrapped with the perf call: | Runtime profiling with perf is very simple. You execute your executable wrapped with the perf call: | ||
+ | |||
+ | <syntaxhighlight lang="sh"> | ||
perf record ./miniMD | perf record ./miniMD | ||
+ | </syntaxhighlight> | ||
After the application finished the results can be analysed with | After the application finished the results can be analysed with | ||
+ | |||
+ | <syntaxhighlight lang="sh"> | ||
perf report | perf report | ||
+ | </syntaxhighlight> | ||
which opens a ncurses based presentation of the results: | which opens a ncurses based presentation of the results: | ||
+ | |||
+ | <syntaxhighlight lang="sh"> | ||
Samples: 30K of event 'cycles:uppp', Event count (approx.): 20629160088 | Samples: 30K of event 'cycles:uppp', Event count (approx.): 20629160088 | ||
Overhead Command Shared Object Symbol | Overhead Command Shared Object Symbol | ||
Line 108: | Line 116: | ||
0.01% miniMD-ICC libc-2.17.so [.] getenv | 0.01% miniMD-ICC libc-2.17.so [.] getenv | ||
0.01% miniMD-ICC libmpi.so.12.0.0 [.] MPL_wtime | 0.01% miniMD-ICC libmpi.so.12.0.0 [.] MPL_wtime | ||
+ | </syntaxhighlight> |
Revision as of 14:07, 2 April 2019
Introduction
The initial task in any performance analysis is to figure out in which parts of the code the runtime is spent. One wants to focus for optimisation on those regions of the code to achieve an overall speedup of the code. The tool helping to get an overview of where the time is spent is called a runtime profiler. There exist two flavours: Instrumentation based and sampling based profilers. Instrumentation based profilers insert function calls to measure the time at points in the program. Additional tasks may be performed as e.g. determining the function call stack. While it is possible to insert instrumentation calls on the binary level the common way is that the compiler adds instrumentation functions. The standard tool in this area is gprof and almost any compiler supports to instrument the code for gprof. Statistical sampling based profiling on the other hand are based on probing of the programs call stack triggered by operating system interrupts at regular intervals. A widespread tool for sampling based profiling is the perf tool which builds on the builtin profiling infrastructure in recent Linux kernels. Both approaches have advantages and disadvantages: Instrumentation produces more accurate results but introduces more overhead and sampling has less overhead but produce less accurate results. Special care is necessary for runtime profiling of parallel (OpenMP or MPI) applications.
How to use gprof
The first step is to compile and link the program with profiling enabled. For most compilers this is achieved by setting the ```-pg``` flag. For Intel compilers it is important to set the optimization flag afterward as the default optimization level is set to ```-O0``` when enabling profiling.
Example for build options with Intel tool chain:
icc -pg -O3 -c myfile1.c
icc -pg -O3 -c myfile2.c
icc -o a.out -pg myfile1.o myfile2.o
The generated application is executed and generates a ```gmon.out``` file containing the profiling output. To analyse the profile the tool ```gprof``` is used with the executable as argument:
gprof ./miniMD
Refer to the gprof man page for additional command line options. The result will be printed on stdout. It is recommended to either redirect the output to a file or use a pipe to the less pager command.
gprof ./miniMD | less
The default output consists of three parts: A flat profile, the call graph, and an alphabetical index of routines. For most purposes the flat profile is what you are looking for.
Example output of the flat profile for the Mantevo miniMD proxy app:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
66.86 26.14 26.14 502 0.05 0.05 ForceLJ::compute(Atom&, Neighbor&, Comm&, int)
30.77 38.17 12.03 26 0.46 0.46 Neighbor::build(Atom&)
1.43 38.73 0.56 1 0.56 38.46 Integrate::run(Atom&, Force*, Neighbor&, Comm&, Thermo&, Timer&)
0.36 38.87 0.14 2850 0.00 0.00 Atom::pack_comm(int, int*, double*, int*)
0.15 38.93 0.06 2850 0.00 0.00 Atom::unpack_comm(int, int, double*)
0.13 38.98 0.05 26 0.00 0.00 Atom::pbc()
0.10 39.02 0.04 __intel_ssse3_rep_memcpy
0.08 39.05 0.03 25 0.00 0.00 Atom::sort(Neighbor&)
0.08 39.08 0.03 1 0.03 0.03 create_atoms(Atom&, int, int, int, double)
0.05 39.10 0.02 26 0.00 0.00 Comm::borders(Atom&)
0.00 39.10 0.00 1221559 0.00 0.00 Atom::pack_border(int, double*, int*)
0.00 39.10 0.00 1221559 0.00 0.00 Atom::unpack_border(int, double*)
0.00 39.10 0.00 131072 0.00 0.00 Atom::addatom(double, double, double, double, double, double)
0.00 39.10 0.00 1025 0.00 0.00 Timer::stamp(int)
0.00 39.10 0.00 502 0.00 0.00 Thermo::compute(int, Atom&, Neighbor&, Force*, Timer&, Comm&)
0.00 39.10 0.00 500 0.00 0.00 Timer::stamp()
0.00 39.10 0.00 475 0.00 0.00 Comm::communicate(Atom&)
0.00 39.10 0.00 26 0.00 0.00 Comm::exchange(Atom&)
0.00 39.10 0.00 25 0.00 0.00 Timer::stamp_extra_stop(int)
0.00 39.10 0.00 25 0.00 0.00 Timer::stamp_extra_start()
0.00 39.10 0.00 25 0.00 0.00 Neighbor::binatoms(Atom&, int)
0.00 39.10 0.00 7 0.00 0.00 Timer::barrier_stop(int)
0.00 39.10 0.00 1 0.00 0.00 create_box(Atom&, int, int, int, double)
0.00 39.10 0.00 1 0.00 0.00 create_velocity(double, Atom&, Thermo&)
The output is sorted according to the total time spent in it. The interesting columns are self seconds (the time spent in the routine itself), calls (how often it was called) and self s/call (how much time was spent per call).
How to use perf
Runtime profiling with perf is very simple. You execute your executable wrapped with the perf call:
perf record ./miniMD
After the application finished the results can be analysed with
perf report
which opens a ncurses based presentation of the results:
Samples: 30K of event 'cycles:uppp', Event count (approx.): 20629160088
Overhead Command Shared Object Symbol
64.19% miniMD-ICC miniMD-ICC [.] ForceLJ::compute
31.54% miniMD-ICC miniMD-ICC [.] Neighbor::build
1.47% miniMD-ICC miniMD-ICC [.] Integrate::run
0.67% miniMD-ICC [kernel] [k] irq_return
0.40% miniMD-ICC miniMD-ICC [.] Atom::pack_comm
0.35% mpiexec [kernel] [k] sysret_check
0.21% miniMD-ICC miniMD-ICC [.] create_atoms
0.18% miniMD-ICC miniMD-ICC [.] Atom::unpack_comm
0.15% miniMD-ICC [kernel] [k] sysret_check
0.15% miniMD-ICC miniMD-ICC [.] Comm::borders
0.10% miniMD-ICC miniMD-ICC [.] __intel_ssse3_rep_memcpy
0.09% miniMD-ICC miniMD-ICC [.] Atom::sort
0.07% miniMD-ICC miniMD-ICC [.] Neighbor::binatoms
0.05% mpiexec [kernel] [k] irq_return
0.04% miniMD-ICC miniMD-ICC [.] Atom::pbc
0.03% miniMD-ICC miniMD-ICC [.] Atom::unpack_border
0.03% miniMD-ICC miniMD-ICC [.] Atom::addatom
0.02% miniMD-ICC miniMD-ICC [.] Atom::pack_border
0.02% hydra_pmi_proxy [kernel] [k] sysret_check
0.01% miniMD-ICC miniMD-ICC [.] create_velocity
0.01% mpiexec libc-2.17.so [.] vfprintf
0.01% miniMD-ICC ld-2.17.so [.] _dl_lookup_symbol_x
0.01% miniMD-ICC ld-2.17.so [.] do_lookup_x
0.01% hydra_bstrap_pr [kernel] [k] irq_return
0.01% hydra_pmi_proxy [kernel] [k] irq_return
0.01% hydra_bstrap_pr [kernel] [k] sysret_check
0.01% miniMD-ICC libmpi.so.12.0.0 [.] MPIR_T_CVAR_REGISTER_impl
0.01% miniMD-ICC libc-2.17.so [.] getenv
0.01% miniMD-ICC libmpi.so.12.0.0 [.] MPL_wtime