Load Balancing

From HPC Wiki
Jump to navigation Jump to search

This is a short overview over the basic concepts of Load Balancing. Load Balancing should be taken into account whenever trying to improve a program's performance by parallelization (Parallel_Programming).

General

Parallelization is used to achieve a speedup of the runtime of a program while using the available processors as efficiently as possible. The most common first approach is to parallelize loops in the program, as they usually have a high impact on the runtime. Tools like Intel VTune can be used to find so-called hotspots in a program.

When splitting the workload across multiple processors, the speedup describes the achieved reduction of the runtime compared to the serial version. Obviously, the higher the speedup, the better. However, the effect of increasing the number of processors usually goes down at a certain point when they cannot be utilized optimally anymore. This parallel efficiency is defined as the achieved speedup divided by the number of processors used. It usually is not possibly to achieve a speedup equal to the number of processors used as most applications have strictly serial parts (Amdahl's Law).

Load balancing is of great importance when utilizing multiple processors as efficient as possible. Adding more processors creates a noticable amount of synchronisation overhead. Therefore, it is only beneficial if there is enough work present to keep all processors busy at the same time. This means splitting the workload into equal parts over all processors and minimising the waiting times at synchronisation points (e.g. barriers).

Example

for(int i=0; i<1024; i++){ 
    dosomething();
}

We assume that there are no data dependencies within the loop. Let the runtime of this loop with 1 processor be 60s. If we split the loop into 2 halves (e.g. i=0-511; i=512-1023) over 2 processors we can expect the runtime to be halved (30s), which would be a speedup of 2 and a perfect efficiency of 1. As the number of instructions is the same for every i it does not matter how we split the loop as long as we split it into equal sizes (ignoring data access optimization, which is not covered here). However, not every loop (or other code snippet) can be distributed this easily.

Let's consider a more complex loop:

for(int i=0; i<1024; i++){ 
    for(int j=0; j<i+1; j++){
        dosomething();
    }
}

With increasing i the size of the inner loop increases. Simply splitting it in the middle would result in one processor receiving much less work and therefore finishing much earlier than the other processor. While this would still result in a small speedup, it would have a very low efficiency as we are basically wasting resources by letting one processor idle. More complex loops therefore require different strategies in terms of load balancing, for example handing every processor only small chunks at a time.

Amdahl's Law

s: serial part of the application

p: parallel part of the application

N: number of processors


Amdahl's Law assumes perfect load balance. Generally, load balance is defined as follows with tcomp being the time a processor spent with actual work.