Difference between revisions of "Load Balancing"

From HPC Wiki
Jump to navigation Jump to search
()
Line 3: Line 3:
 
__TOC__
 
__TOC__
  
== Theory ==
+
== General ==
Parallelization is used to achieve a speedup of the runtime of a program while using the available processors as efficiently as possible. The most common first approach is to parallelize loops in the program, as they usually have a high impact on the runtime. Tools like IntelVTune can be used to find so-called hotspots in a program.
+
Parallelization is used to achieve a speedup of the runtime of a program while using the available processors as efficiently as possible. The most common first approach is to parallelize loops in the program, as they usually have a high impact on the runtime. Tools like [[Intel VTune]] can be used to find so-called hotspots in a program.
  
When splitting the workload across multiple processors, the speedup of the program is the runtime of the serial run divided by the runtime of the parallelized run and is therefore limited by the number of processors (N). As usually only parts of the code can be parallelized (p), it is further limited by the strictly serial part of the program (s).  (Amdahl's Law).  
+
When splitting the workload across multiple processors, the speedup describes the achieved reduction of the runtime compared to the serial version. Obviously, the higher the speedup, the better. However, the effect of increasing the number of processors usually goes down at a certain point when they cannot be utilized optimally anymore. This parallel efficiency is defined as the achieved speedup divided by the number of processors used. It usually is not possibly to achieve a speedup equal to the number of processors used as most applications have strictly serial parts (Amdahl's Law).  
  
: <math>T(N) = (s + \frac p N) * T(1)</math>
+
Load balancing is of great importance when utilizing multiple processors as efficient as possible. Adding more processors creates a noticable amount of synchronisation overhead. Therefore, it is only beneficial if there is enough work present to keep all processors busy at the same time. This means splitting the workload into equal parts over all processors and minimising the waiting times at synchronisation points (e.g. barriers).
: <math>\text{Speedup S(N)} = \frac {T(1)} {T(N)} = \frac 1 {s + \frac {1-s} N}</math>
 
 
 
The parallel efficiency is defined as the speedup divided by the number of processors and is therefore limited by 1.  
 
 
 
: <math>\text{Efficiency E(N)} = \frac {S(N)} N = \frac {\frac 1 {s + \frac {1-s} N}} N = \frac 1 {s(N-1)+1}</math>
 
 
 
However, the above calculations assume perfect load balancing, meaning that the average and maximum time a processor takes to finish (tcomp) are equal to another and there is no waiting/idling time. This is not usually possible for complex programs and mostly must be adjusted manually in the code.
 
 
 
: <math>\text{Load Balance LB} = \frac {avg(tcomp)} {max(tcomp)}</math>
 
  
 
== Example ==
 
== Example ==
Line 39: Line 30:
  
 
With increasing i the size of the inner loop increases. Simply splitting it in the middle would result in one processor receiving much less work and therefore finishing much earlier than the other processor. While this would still result in a small speedup, it would have a very low efficiency as we are basically wasting resources by letting one processor idle. More complex loops therefore require different strategies in terms of load balancing, for example handing every processor only small chunks at a time.
 
With increasing i the size of the inner loop increases. Simply splitting it in the middle would result in one processor receiving much less work and therefore finishing much earlier than the other processor. While this would still result in a small speedup, it would have a very low efficiency as we are basically wasting resources by letting one processor idle. More complex loops therefore require different strategies in terms of load balancing, for example handing every processor only small chunks at a time.
 +
 +
== Amdahl's Law ==
 +
 +
s: serial part of the application
 +
 +
p: parallel part of the application
 +
 +
N: number of processors
 +
 +
: <math>T(N) = (s + \frac p N) * T(1)</math>
 +
 +
: <math>\text{Speedup S(N)} = \frac {T(1)} {T(N)} = \frac 1 {s + \frac {1-s} N}</math>
 +
 +
: <math>\text{Efficiency E(N)} = \frac {S(N)} N = \frac {\frac 1 {s + \frac {1-s} N}} N = \frac 1 {s(N-1)+1}</math>
 +
 +
 +
Amdahl's Law assumes perfect load balance. Generally, load balance is defined as follows with tcomp being the time a processor spent with actual work.
 +
 +
: <math>\text{Load Balance LB} = \frac {avg(tcomp)} {max(tcomp)}</math>

Revision as of 13:52, 7 January 2019

This is a short overview over the basic concepts of Load Balancing. Load Balancing should be taken into account whenever trying to improve a program's performance by parallelization (Parallel_Programming).

General

Parallelization is used to achieve a speedup of the runtime of a program while using the available processors as efficiently as possible. The most common first approach is to parallelize loops in the program, as they usually have a high impact on the runtime. Tools like Intel VTune can be used to find so-called hotspots in a program.

When splitting the workload across multiple processors, the speedup describes the achieved reduction of the runtime compared to the serial version. Obviously, the higher the speedup, the better. However, the effect of increasing the number of processors usually goes down at a certain point when they cannot be utilized optimally anymore. This parallel efficiency is defined as the achieved speedup divided by the number of processors used. It usually is not possibly to achieve a speedup equal to the number of processors used as most applications have strictly serial parts (Amdahl's Law).

Load balancing is of great importance when utilizing multiple processors as efficient as possible. Adding more processors creates a noticable amount of synchronisation overhead. Therefore, it is only beneficial if there is enough work present to keep all processors busy at the same time. This means splitting the workload into equal parts over all processors and minimising the waiting times at synchronisation points (e.g. barriers).

Example

for(int i=0; i<1024; i++){ 
    dosomething();
}

We assume that there are no data dependencies within the loop. Let the runtime of this loop with 1 processor be 60s. If we split the loop into 2 halves (e.g. i=0-511; i=512-1023) over 2 processors we can expect the runtime to be halved (30s), which would be a speedup of 2 and a perfect efficiency of 1. As the number of instructions is the same for every i it does not matter how we split the loop as long as we split it into equal sizes (ignoring data access optimization, which is not covered here). However, not every loop (or other code snippet) can be distributed this easily.

Let's consider a more complex loop:

for(int i=0; i<1024; i++){ 
    for(int j=0; j<i+1; j++){
        dosomething();
    }
}

With increasing i the size of the inner loop increases. Simply splitting it in the middle would result in one processor receiving much less work and therefore finishing much earlier than the other processor. While this would still result in a small speedup, it would have a very low efficiency as we are basically wasting resources by letting one processor idle. More complex loops therefore require different strategies in terms of load balancing, for example handing every processor only small chunks at a time.

Amdahl's Law

s: serial part of the application

p: parallel part of the application

N: number of processors


Amdahl's Law assumes perfect load balance. Generally, load balance is defined as follows with tcomp being the time a processor spent with actual work.