Difference between revisions of "Scaling"
Line 133: | Line 133: | ||
| <span style="color:red; background:#ffffcc"> 2048 (52 nodes)</span> || <span style="color:red; background:#ffffcc">21.462228</span> || <span style="color:red; background:#ffffcc">3.00</span> || <span style="color:red; background:#ffffcc">0.001</span> | | <span style="color:red; background:#ffffcc"> 2048 (52 nodes)</span> || <span style="color:red; background:#ffffcc">21.462228</span> || <span style="color:red; background:#ffffcc">3.00</span> || <span style="color:red; background:#ffffcc">0.001</span> | ||
|} | |} | ||
+ | <span style="color:red"> Note: Due to bad Speedup we haven’t plotted the row with 2048 processors but it is given for your reference to indicate the trend of decreasing speedup.</span> |
Revision as of 09:57, 18 September 2020
In the most general sense, scalability is defined as the ability to handle more work as the size of the computer or application grows. scalability or scaling is widely used to indicate the ability of hardware and software to deliver greater computational power when the amount of resources is increased. For HPC clusters, it is important that they are scalable, in other words the capacity of the whole system can be proportionally increased by adding more hardware. For software, scalability is sometimes referred to as parallelization efficiency — the ratio between the actual speedup and the ideal speedup obtained when using a certain number of processors. For this tutorial, we focus on software scalability and discuss two common types of scaling. The speedup in parallel computing can be straightforwardly defined as
|
where t1 is the computational time for running the software using one processor, and tN is the computational time running the same software with N processors. Ideally, we would like software to have a linear speedup that is equal to the number of processors (speedup = N), as that would mean that every processor would be contributing 100% of its computational power. Unfortunately, this is a very challenging goal for real world applications to attain.
Scaling tests
As we have already indicated, the primary challenge of parallel computing is deciding how best to break up a problem into individual pieces that can each be computed separately. Large applications are usually not developed and tested using the full problem size and/or number of processor right from the start, as this comes with long waits and a high usage of resources. It is therefore advisable to scale these factors down at first which also enables one to estimate the required resources for the full run more accurately in terms of Resource planning . Scalability testing measures the ability of an application to perform well or better with varying problem sizes and numbers of processors. It does not test the applications general funcionality or correctness.
Strong or Weak Scaling
Applications can generally be divided into strong scaling and weak scaling applications. Please note that the terms strong and weak themselves do not give any information whatsoever on how well an application actually scales. We restate the definitions mentioned in Scaling tests of both strong/weak scaling and elaborate more details for calculating the efficiency and speedup for them below.
Strong Scaling
In case of strong scaling, the number of processors is increased while the problem size remains constant. This also results in a reduced workload per processor. Strong scaling is mostly used for long-running CPU-bound applications to find a setup which results in a reasonable runtime with moderate resource costs. The individual workload must be kept high enough to keep all processors fully occupied. The speedup achieved by increasing the number of processes usually decreases more or less continuously.
In an idealworld a problem would scale in a linear fashion, that is, the program would speed up by a factor of N when it runs on a machine having N nodes. (Of course, as N→ ∞ the proportionality cannot hold because communication time must then dominate. Clearly then, the goal when solving a problem that scales strongly is to decrease the amount of time it takes to solve the problem by using a more powerful computer. These are typically CPU-bound problems and are the hardest ones to yield something close to a linear speedup.
Amdahl’s law and strong scaling In 1967, Amdahl pointed out that the speedup is limited by the fraction of the serial part of the software that is not amenable to parallelization. Amdahl’s law can be formulated as follows
|
where s is the proportion of execution time spent on the serial part, p is the proportion of execution time spent on the part that can be parallelized, and N is the number of processors. Amdahl’s law states that, for a fixed problem, the upper limit of speedup is determined by the serial fraction of the code. This is called strong scaling. In this case the problem size stays fixed but the number of processing elements are increased. This is used as justification for programs that take a long time to run (something that is cpu-bound). The goal in this case is to find a "sweet spot" that allows the computation to complete in a reasonable amount of time, yet does not waste too many cycles due to parallel overhead. In strong scaling, a program is considered to scale linearly if the speedup (in terms of work units completed per unit time) is equal to the number of processing elements used ( N ). In general, it is harder to achieve good strong-scaling at larger process counts since the communication overhead for many/most algorithms increases in proportion to the number of processes used.
Calculating Strong Scaling Speedup
If the amount of time needed to complete a serial task t1, and the amount of time to complete the same unit of work with N processing elements (parallel task) is tN, than Speedup is given as:
|
Weak Scaling
In case of weak scaling, both the number of processors and the problem size are increased. This also results in a constant workload per processor. Weak scaling is mostly used for large memory-bound applications where the required memory cannot be satisfied by a single node. They usually scale well to higher core counts as memory access strategies often focus on the nearest neighboring nodes while ignoring those further away and therefore scale well themselves. The upscaling is usually restricted only by the available resources or the maximum problem size. For an application that scales perfectly weakly, the work done by each node remains the same as the scale of the machine increases, which means that we are solving progressively larger problems in the same time as it takes to solve smaller ones on a smaller machine.
Gustafson’s law and weak scaling
Amdahl’s law gives the upper limit of speedup for a problem of fixed size. This seems to be a bottleneck for parallel computing; if one would like to gain a 500 times speedup on 1000 processors, Amdahl’s law requires that the proportion of serial part cannot exceed 0.1%. However, as Gustafson pointed out, in practice the sizes of problems scale with the amount of available resources. If a problem only requires a small amount of resources, it is not beneficial to use a large amount of resources to carry out the computation. A more reasonable choice is to use small amounts of resources for small problems and larger quantities of resources for big problems. Gustafson’s law was proposed in 1988, and is based on the approximations that the parallel part scales linearly with the amount of resources, and that the serial part does not increase with respect to the size of the problem. It provides the formula for scaled speedup
|
where s, p and N have the same meaning as in Amdahl’s law. With Gustafson’s law the scaled speedup increases linearly with respect to the number of processors (with a slope smaller than one), and there is no upper limit for the scaled speedup. This is called weak scaling, where the scaled speedup is calculated based on the amount of work done for a scaled problem size (in contrast to Amdahl’s law which focuses on fixed problem size). In this case the problem size (workload) assigned to each processing element stays constant and additional elements are used to solve a larger total problem (one that wouldn't fit in RAM on a single node, for example). Therefore, this type of measurement is justification for programs that take a lot of memory or other system resources (something that is memory-bound). In the case of weak scaling, linear scaling is achieved if the run time stays constant while the workload is increased in direct proportion to the number of processors. Most programs running in this mode should scale well to larger core counts as they typically employ nearest-neighbour communication patterns where the communication overhead is relatively constant regardless of the number of processes used; exceptions include algorithms that employ heavy use of global communication patterns, eg. FFTs and transposes.
Calculating Weak Scaling Efficiency
If the amount of time to complete a work unit with 1 processing element is t1, and the amount of time to complete N of the same work units with N processing elements is tN, the weak scaling efficiency is given as:
|
The concepts of weak and strong scaling are ideals that tend not to be achieved in practice, with real world applications having some of each present. Furthermore, it is the combination of application and computer architecture that determine the type of scaling that occurs. For example, shared memory systems and distributed memory, message passing systems scale differently. Further more, a data parallel application (one in which each node can work on its own separate data set) will by its very nature scale weakly. Before we go on and set you working on some examples of scaling, we should introduce a note of caution. Realistic applications tend to have various levels of complexity and so it may not be obvious just how to measure the increase in “size” of a problem. As an instance, it is known that the solution of a set of N linear equations via Gaussian elimination requires O(N3) floating-point operations (flops). This means that doubling the number of equations does not make the “problem” twice as large, but rather eight times as large! Likewise, if we are solving partial differential equations on a three-dimensional spatial grid and a 1-D time grid, then the problem size would scale like N4.
Measuring parallel scaling performance
When using HPC clusters, it is almost always worthwhile to measure the parallel scaling of your jobs. The measurement of strong scaling is done by testing how the overall computational time of the job scales with the number of processing elements (being either threads or MPI processes), while the test for weak scaling is done by increasing both the job size and the number of processing elements. The results from the parallel scaling tests will provide a good indication of the amount of resources to request for the size of the particular job.
Scaling Measurement Guidelines
Further to basic code performance and optimization concerns (ie. the single thread performance), one should consider the following when timing their application:
1. Use wallclock time units or equivalent. o eg. timesteps completed per second, etc. 2. Measure using job sizes that span: o from 1 to the number of processing elements per node for threaded jobs. o from 1 to the total number of processes requested for MPI. o job size increments should be in power-of-2 or equivalent (cube powers for weak-scaling 3D simulations, for example). o NOTE: it is inappropriate to refer to scaling numbers with more than 1 cpu as the baseline. - in scenarios where the memory requirements exceed what is available on a single node, one should provide scaling performance for smaller data-sets (lower resolution) so that scaling performance can can be compared throughout the entire range from 1 to the number of processes they wish to use, or as close to this as possible, in addition to any results at the desired problem size. 3. Measure multiple independent runs per job size. o average results and remove outliers as appropriate. 4. Use a problem state or configuration that best matches your intended production runs. o scaling should be measured based on the overall performance of the application. o no simplified models or preferential configurations. 5. Various factors must be taken into account when more than one node is used: a) Interconnectspeed and latency b) Max memory per node c) processors per node d) max processors (nodes) e) system variables and restrictions (e.g. stacksize) NOTE: For applications using MPI the optimization of the MPI settings can also dramatically improve the application performance. MPI applications also require a certain amount of memory for each MPI process, which obbiouvlsy increases with the number of processors and MPI processes used. 6. Additionaly but not necessarily if possible measure using different systems. Most importantly ones that have significantly different processor / network balances (ie. CPU speed vs. interconnect speed). NOTE: The point no 5 as mentioned is not necessary but can be used if code optimization ist to be done.
Once you have timed your application you should convert the results to scaling efficiencies as explained below. To demonstrate an example for both the weak and strong scaling, a simple example of a conjugate gradient code (https://github.com/yuhlearn/conjugate_gradient) from Github is used. The code is Parallellised using MPI and the user can define N (is the problem size) as the first argument, while executing the code in the terminal. Using N a NxN matrix is generated and filled with values using a random number generator function ( rand() ) in C. Since this example is used to demonstrate the scaling the output from the code run is not analyzed for correctness but care is taken that the code runs successfully. For strong scaling a problem size N=40000 is choosen and is kept constant while increasing the no of processors. The code is run on the compute nodes of the Noctua at PC2. The details of the nodes are
Noctua, PC2, Paderborn | |
---|---|
CPUs per node | 2 (20 cores per CPU) |
CPU type | Intel Xeon Gold 6148 |
main memory per node | 192 GiB |
interconnect | Intel Omni-Path 100 Gb/s |
accelerators used (such as GPUs) | no |
number of MPI processes per node | 40 |
number of threads per MPI process (e.g. OpenMP threads) | 1 |
The table below gives a overlook at the speedup for strong scaling acheived for the Conjugate Gradient code.
Strong Scaling
To recall, In case of strong scaling, the number of processors is increased while the problem size remains constant. This also results in a reduced workload per processor.
#problem size (N x N) = 1600000000, where N is Matrix size and N = 40000 for all different processor numbers | |||
---|---|---|---|
#Processors | #time in seconds | (Amdahl’s law) #speeedup = (Ts/Tp) | #efficiency = (TsxNs / TpxNp) |
1 (1 node) | 64.424242 | 1 | 1 |
2 (1 node) | 33.901724 | 1.90 | 0.95 |
4 (1 node) | 17.449995 | 3.69 | 0.92 |
8 (1 node) | 8.734972 | 7.38 | 0.92 |
16 (1 node) | 4.789075 | 13.45 | 0.84 |
32 (1 node) | 2.749116 | 23.43 | 0.73 |
64 (2 nodes) | 1.627157 | 39.59 | 0.62 |
128 (4 nodes) | 1.017307 | 63.33 | 0.49 |
256 (7 nodes) | 1.436728 | 44.84 | 0.18 |
512 (13 nodes) | 3.689217 | 17.46 | 0.03 |
1024 (26 nodes) | 4.709213 | 13.68 | 0.01 |
2048 (52 nodes) | 21.462228 | 3.00 | 0.001 |
Note: Due to bad Speedup we haven’t plotted the row with 2048 processors but it is given for your reference to indicate the trend of decreasing speedup.