The pattern "Load Imbalance" describes a common problem in parallelized applications. It describes the problem when work is not equally distributed over all processing units and consequently some unit(s) do more work than others. This commonly results in wait time for the processing units being faster (less work) until the slower ones (more work) finished their task at a synchronization point.
- Saturating/sub-linear speedup
The detection mechanisms depend on the definition of 'work' for the application. If floating-point calculations are the smallest task of processing, you can use hardware performance monitoring tools:
- LIKWID with performance groups FLOPS_DP and FLOPS_SP
- PAPI with papi_mflops() or PAPI_SP_OPS and PAPI_DP_OPS events
- perf offers fp_arith_inst_retired.* events
If other operations are your smallest task and there are no hardware performance events available to count them, use measurements near to the processing units which regards data transfers, the inputs for your work.
- LIKWID with performance groups DATA and L1
- PAPI and perf also provide events for load/store counting at each CPU core and data transfers between core and L1 cache
Possible optimizations and/or fixes
- Balance the work over all processing units as good as possible.