Load Balancing
This is a short overview over the basic concepts of Load Balancing. Load Balancing should be taken into account whenever trying to improve a program's performance by parallelization (Parallel_Programming).
Theory
Parallelization is used to achieve a speedup of the runtime of a program while using the available processors as efficiently as possible. The most common first approach is to parallelize loops in the program, as they usually have a high impact on the runtime. Tools like IntelVTune can be used to find so-called hotspots in a program.
When splitting the workload across multiple processors, the speedup of the program is the runtime of the serial run divided by the runtime of the parallelized run and is therefore limited by the number of processors (N). As usually only parts of the code can be parallelized (p), it is further limited by the strictly serial part of the program (s). (Amdahl's Law).
The parallel efficiency is defined as the speedup divided by the number of processors and is therefore limited by 1.
However, the above calculations assume perfect load balancing, meaning that the average and maximum time a processor takes to finish (tcomp) are equal to another and there is no waiting/idling time. This is not usually possible for complex programs and mostly must be adjusted manually in the code.
Example
for(int i=0; i<1024; i++){
dosomething();
}
We assume that there are no data dependencies within the loop. Let the runtime of this loop with 1 processor be 60s. If we split the loop into 2 halves (e.g. i=0-511; i=512-1023) over 2 processors we can expect the runtime to be halved (30s), which would be a speedup of 2 and a perfect efficiency of 1. As the number of instructions is the same for every i it does not matter how we split the loop as long as we split it into equal sizes (ignoring data access optimization, which is not covered here). However, not every loop (or other code snippet) can be distributed this easily.
Let's consider a more complex loop:
for(int i=0; i<1024; i++){
for(int j=0; j<i+1; j++){
dosomething();
}
}
With increasing i the size of the inner loop increases. Simply splitting it in the middle would result in one processor receiving much less work and therefore finishing much earlier than the other processor. While this would still result in a small speedup, it would have a very low efficiency as we are basically wasting resources by letting one processor idle. More complex loops therefore require different strategies in terms of load balancing, for example handing every processor only small chunks at a time.