Difference between revisions of "BandwidthSaturation"
(Created page with "== Description == The pattern "Bandwidth Saturation" describes the performance limitation caused by fully utilizing a shared data path. It depends on the system level which da...") |
|||
Line 6: | Line 6: | ||
== Detection == | == Detection == | ||
+ | === Node-level === | ||
At system level use a hardware-counter tool like: | At system level use a hardware-counter tool like: | ||
* LIKWID with performance groups MEM and L3 | * LIKWID with performance groups MEM and L3 | ||
* Same information can be provided by perf or PAPI | * Same information can be provided by perf or PAPI | ||
+ | |||
+ | <nowiki> | ||
+ | $ ./stream_c.exe | ||
+ | [...] | ||
+ | Number of Threads requested = 72 | ||
+ | Number of Threads counted = 72 | ||
+ | [...] | ||
+ | Function Best Rate MB/s Avg time Min time Max time | ||
+ | Copy: 89404.0 0.015000 0.014317 0.019208 | ||
+ | Scale: 89804.8 0.014478 0.014253 0.016137 | ||
+ | Add: 111841.9 0.017226 0.017167 0.017482 | ||
+ | Triad: 112802.2 0.017378 0.017021 0.022843 | ||
+ | </nowiki> | ||
+ | for Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz | ||
+ | |||
+ | When running your code, track the memory bandwidth with a suitable tool ([https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr likwid-perfctr] with L2, L3, and MEM setting, [https://icl.utk.edu/papi/index.html PAPI] with L1*, L2*, L3* metrics or [https://perf.wiki.kernel.org/index.php/Main_Page perf] with L1*, LLC* and uncore_imc* events). | ||
+ | As example, we use a 2D 5pt Jacobi stencil with LIKWID MarkerAPI instrumentation: | ||
+ | <nowiki> | ||
+ | $ likwid-perfctr -C N:0-71 -g MEM -m ./jacobi N=10000 M=10000 iters=400 | ||
+ | +----------------------------------------+-------------+-----------+------------+-----------+ | ||
+ | | Metric | Sum | Min | Max | Avg | | ||
+ | +----------------------------------------+-------------+-----------+------------+-----------+ | ||
+ | | Runtime (RDTSC) [s] STAT | 633.3824 | 7.9773 | 8.8253 | 8.7970 | | ||
+ | | Runtime unhalted [s] STAT | 572.7102 | 7.1261 | 8.0533 | 7.9543 | | ||
+ | | Clock [MHz] STAT | 165407.9898 | 2297.3075 | 2297.3480 | 2297.3332 | | ||
+ | | CPI STAT | 190.4477 | 2.3898 | 2.6929 | 2.6451 | | ||
+ | | Memory read bandwidth [MBytes/s] STAT | 77683.1009 | 0 | 38961.5229 | 1078.9320 | | ||
+ | | Memory read data volume [GBytes] STAT | 619.8433 | 0 | 310.9481 | 8.6089 | | ||
+ | | Memory write bandwidth [MBytes/s] STAT | 40310.8116 | 0 | 20292.1240 | 559.8724 | | ||
+ | | Memory write data volume [GBytes] STAT | 321.6454 | 0 | 161.9495 | 4.4673 | | ||
+ | | Memory bandwidth [MBytes/s] STAT | 117993.9126 | 0 | 59253.6469 | 1638.8043 | | ||
+ | | Memory data volume [GBytes] STAT | 941.4886 | 0 | 472.8975 | 13.0762 | | ||
+ | +----------------------------------------+-------------+-----------+------------+-----------+ | ||
+ | </nowiki> | ||
+ | |||
+ | Compare the reported bandwidth with the one from your benchmarking run. If it is comparable, your code saturates the bandwidth. | ||
+ | |||
+ | === Cluster-level === | ||
At cluster level you need special tools for the specific network technology | At cluster level you need special tools for the specific network technology | ||
+ | |||
+ | |||
== Possible optimizations and/or fixes == | == Possible optimizations and/or fixes == | ||
− | + | In general, it is beneficial to saturate the bandwidth as your code runs at the highest performance possible but maybe you can do better. | |
+ | |||
+ | If you are saturating memory bandwidth, you can try to increase the usage of data already loaded into the caches (spacial blocking, temporal blocking, ...) | ||
+ | |||
+ | If you have multiple loops traversing the same data, fuse the loops. | ||
+ | |||
+ | == Applicable applications or algorithms or kernels == | ||
+ | Examples applications which commonly saturate bandwidth: | ||
+ | * STREAM benchmark | ||
+ | * others |
Revision as of 18:43, 12 March 2019
Description
The pattern "Bandwidth Saturation" describes the performance limitation caused by fully utilizing a shared data path. It depends on the system level which data path can be saturated. Inside CPU packages and whole compute nodes, the main resource for saturation is the memory subsystem but also the last level cache is a candidate. At cluster level, the shared data path is the interconnect network shared by all compute nodes.
Symptoms
- Saturating speedup across cores sharing a data path
Detection
Node-level
At system level use a hardware-counter tool like:
- LIKWID with performance groups MEM and L3
- Same information can be provided by perf or PAPI
$ ./stream_c.exe [...] Number of Threads requested = 72 Number of Threads counted = 72 [...] Function Best Rate MB/s Avg time Min time Max time Copy: 89404.0 0.015000 0.014317 0.019208 Scale: 89804.8 0.014478 0.014253 0.016137 Add: 111841.9 0.017226 0.017167 0.017482 Triad: 112802.2 0.017378 0.017021 0.022843
for Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
When running your code, track the memory bandwidth with a suitable tool (likwid-perfctr with L2, L3, and MEM setting, PAPI with L1*, L2*, L3* metrics or perf with L1*, LLC* and uncore_imc* events). As example, we use a 2D 5pt Jacobi stencil with LIKWID MarkerAPI instrumentation:
$ likwid-perfctr -C N:0-71 -g MEM -m ./jacobi N=10000 M=10000 iters=400 +----------------------------------------+-------------+-----------+------------+-----------+ | Metric | Sum | Min | Max | Avg | +----------------------------------------+-------------+-----------+------------+-----------+ | Runtime (RDTSC) [s] STAT | 633.3824 | 7.9773 | 8.8253 | 8.7970 | | Runtime unhalted [s] STAT | 572.7102 | 7.1261 | 8.0533 | 7.9543 | | Clock [MHz] STAT | 165407.9898 | 2297.3075 | 2297.3480 | 2297.3332 | | CPI STAT | 190.4477 | 2.3898 | 2.6929 | 2.6451 | | Memory read bandwidth [MBytes/s] STAT | 77683.1009 | 0 | 38961.5229 | 1078.9320 | | Memory read data volume [GBytes] STAT | 619.8433 | 0 | 310.9481 | 8.6089 | | Memory write bandwidth [MBytes/s] STAT | 40310.8116 | 0 | 20292.1240 | 559.8724 | | Memory write data volume [GBytes] STAT | 321.6454 | 0 | 161.9495 | 4.4673 | | Memory bandwidth [MBytes/s] STAT | 117993.9126 | 0 | 59253.6469 | 1638.8043 | | Memory data volume [GBytes] STAT | 941.4886 | 0 | 472.8975 | 13.0762 | +----------------------------------------+-------------+-----------+------------+-----------+
Compare the reported bandwidth with the one from your benchmarking run. If it is comparable, your code saturates the bandwidth.
Cluster-level
At cluster level you need special tools for the specific network technology
Possible optimizations and/or fixes
In general, it is beneficial to saturate the bandwidth as your code runs at the highest performance possible but maybe you can do better.
If you are saturating memory bandwidth, you can try to increase the usage of data already loaded into the caches (spacial blocking, temporal blocking, ...)
If you have multiple loops traversing the same data, fuse the loops.
Applicable applications or algorithms or kernels
Examples applications which commonly saturate bandwidth:
- STREAM benchmark
- others