BandwidthSaturation

From HPC Wiki
Jump to: navigation, search

Description

The pattern "Bandwidth Saturation" describes the performance limitation caused by fully utilizing a shared data path. It depends on the system level which data path can be saturated. Inside CPU packages and whole compute nodes, the main resource for saturation is the memory subsystem but also the last level cache is a candidate. At cluster level, the shared data path is the interconnect network shared by all compute nodes.

Bandwidth Saturation means that the data source cannot provide more data per time interval. Since the source is shared by multiple consumers (cores), the consumers compete for new data (wait longer until data is transferred) and therefore have to reduce the number of work they can process.

Symptoms

  • Saturating speedup across cores sharing a data path

Detection

Node-level

In order to detect saturated bandwidth, you need some reference value. For memory bandwidth on node-level, a common benchmark is STREAM.

$ ./stream_c.exe
[...]
Number of Threads requested = 72
Number of Threads counted = 72
[...]
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           89404.0     0.015000     0.014317     0.019208
Scale:          89804.8     0.014478     0.014253     0.016137
Add:           111841.9     0.017226     0.017167     0.017482
Triad:         112802.2     0.017378     0.017021     0.022843
 

for Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz

When running your code, track the memory bandwidth with a suitable tool (likwid-perfctr with L2, L3, and MEM setting, PAPI with L1*, L2*, L3* metrics or perf with L1*, LLC* and uncore_imc* events). As example, we use a 2D 5pt Jacobi stencil with LIKWID MarkerAPI instrumentation:

$ likwid-perfctr -C N:0-71 -g MEM -m ./jacobi N=10000 M=10000 iters=400
[...]
+----------------------------------------+-------------+-----------+------------+-----------+
|                 Metric                 |     Sum     |    Min    |     Max    |    Avg    |
+----------------------------------------+-------------+-----------+------------+-----------+
|        Runtime (RDTSC) [s] STAT        |    633.3824 |    7.9773 |     8.8253 |    8.7970 |
|        Runtime unhalted [s] STAT       |    572.7102 |    7.1261 |     8.0533 |    7.9543 |
|            Clock [MHz] STAT            | 165407.9898 | 2297.3075 |  2297.3480 | 2297.3332 |
|                CPI STAT                |    190.4477 |    2.3898 |     2.6929 |    2.6451 |
|  Memory read bandwidth [MBytes/s] STAT |  77683.1009 |         0 | 38961.5229 | 1078.9320 |
|  Memory read data volume [GBytes] STAT |    619.8433 |         0 |   310.9481 |    8.6089 |
| Memory write bandwidth [MBytes/s] STAT |  40310.8116 |         0 | 20292.1240 |  559.8724 |
| Memory write data volume [GBytes] STAT |    321.6454 |         0 |   161.9495 |    4.4673 |
|    Memory bandwidth [MBytes/s] STAT    | 117993.9126 |         0 | 59253.6469 | 1638.8043 |
|    Memory data volume [GBytes] STAT    |    941.4886 |         0 |   472.8975 |   13.0762 |
+----------------------------------------+-------------+-----------+------------+-----------+
 

Compare the reported bandwidth with the one from your benchmarking run. If it is comparable, your code saturates the bandwidth.

Cluster-level

At cluster level you need special tools for the specific network technology if you want to track saturation in your network.

If you want to measure whether your MPI application cannot run faster because the memory controller of each compute node is saturated, you can use LIKWID again. The reference bandwidth of the STREAM benchmark is still valid, you just need to scale it up to the number of compute nodes.

Running MPI-enabled 2D 5pt Jacobi (-nperdomain S:10 means 10 MPI processes per socket, see likwid-mpirun documentation)

$ cat host.file
node1
node2
$ likwid-mpirun -nperdomain S:10 -g MEM ./jacobi.exe  < input
[...]
+----------------------------------------+------------+-----------+-----------+-----------+
|                 Metric                 |     Sum    |    Min    |    Max    |    Avg    |
+----------------------------------------+------------+-----------+-----------+-----------+
|        Runtime (RDTSC) [s] STAT        |   69.8666  |   1.5139  |   2.2647  |   1.7467  |
|        Runtime unhalted [s] STAT       |   9.9376   |   0.2135  |   0.3213  |   0.2484  |
|            Clock [MHz] STAT            | 86581.8766 | 2139.3240 | 2180.5965 | 2164.5469 |
|                CPI STAT                |   67.0104  |   1.1761  |   2.6045  |   1.6753  |
|  Memory read bandwidth [MBytes/s] STAT | 12255.5233 |     0     | 3556.7119 |  306.3881 |
|  Memory read data volume [GBytes] STAT |   21.4013  |     0     |   5.4006  |   0.5350  |
| Memory write bandwidth [MBytes/s] STAT |  6080.5563 |     0     | 1737.9849 |  152.0139 |
| Memory write data volume [GBytes] STAT |   10.6099  |     0     |   2.6845  |   0.2652  |
|    Memory bandwidth [MBytes/s] STAT    | 18336.0795 |     0     | 5294.6968 |  458.4020 |
|    Memory data volume [GBytes] STAT    |   32.0113  |     0     |   8.0395  |   0.8003  |
+----------------------------------------+------------+-----------+-----------+-----------+
 

Based on these numbers, you can say that there is no bandwidth saturation as you are using two compute nodes and when you scale the memory bandwidth values of the node-level STREAM to two nodes, there is some space for optimization.


Possible optimizations and/or fixes

In general, it is beneficial to saturate the bandwidth as your code runs at the highest performance possible but maybe you can do better.

If you are saturating memory bandwidth, you can try to increase the usage of data already loaded into the caches (spacial blocking, temporal blocking, ...)

If you have multiple loops traversing the same data, fuse the loops.

Applicable applications or algorithms or kernels

Examples applications which commonly saturate bandwidth:

  • STREAM benchmark
  • others