Difference between revisions of "FalseSharing"
(Created page with "== Description == == Symptoms == == Detection == == Possible optimizations and/or fixes == == Applicable applications or algorithms or kernels ==") |
m |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:Performance Pattern]] | ||
== Description == | == Description == | ||
+ | The pattern 'False sharing' describes a performance limitation caused by cache coherency protocols. If a single core loads and modifies a cache lines, the line is loaded once to the L1 cache and modified there. The cache replacement algorithm or explicit stores evict the data from L1 to memory. But if multiple cores try to modify the same cache line, the line is loaded by one core into its L1 cache and modified there. The state of the cache line is changed to 'modified'. The state is available in all cache levels where the line resides (e.g. for inclusive L2 caches: all lines in L1 are also in L2 cache). If another core wants to load the line for modification, the lookup returns the cache line in a 'modified' state and therefore the modification from the other core has to be evicted until reaching a cache level which is shared by both accessing cores. Afterwards the line can be loaded and modified. This causes frequent evictions and HITM loads (Hit in another core's cache in modified state). In colloquial speech, the cache lines bounces between the caches back and forth. | ||
+ | |||
+ | This pattern is limited to parallel execution. | ||
== Symptoms == | == Symptoms == | ||
+ | * Large discrepancy from performance model in parallel case. | ||
+ | * Bad scalability | ||
== Detection == | == Detection == | ||
+ | You need to detect frequent (remote caches') CL evicts. | ||
+ | |||
+ | Many Intel architectures have hardware events to detect HITM accesses. The problem is that these events are not very accurate. Therefore, the listed events can be seen as a qualitative measurement for false sharing but no quantitative one. | ||
+ | |||
+ | * MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM (HITM cache lines in other core's L1/L2 cache but on the same CPU socket) | ||
+ | * MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM (HITM cache lines in remote CPU socket) | ||
+ | For some architectures, LIKWID provides a FALSE_SHARE group. | ||
== Possible optimizations and/or fixes == | == Possible optimizations and/or fixes == | ||
+ | * Avoid the creation of data structures that are accessed by multiple threads for writes | ||
+ | * Global shared variables should be only used in readable fashion | ||
== Applicable applications or algorithms or kernels == | == Applicable applications or algorithms or kernels == | ||
+ | * Naive implementation of histogram codes |
Latest revision as of 15:19, 3 September 2019
Description
The pattern 'False sharing' describes a performance limitation caused by cache coherency protocols. If a single core loads and modifies a cache lines, the line is loaded once to the L1 cache and modified there. The cache replacement algorithm or explicit stores evict the data from L1 to memory. But if multiple cores try to modify the same cache line, the line is loaded by one core into its L1 cache and modified there. The state of the cache line is changed to 'modified'. The state is available in all cache levels where the line resides (e.g. for inclusive L2 caches: all lines in L1 are also in L2 cache). If another core wants to load the line for modification, the lookup returns the cache line in a 'modified' state and therefore the modification from the other core has to be evicted until reaching a cache level which is shared by both accessing cores. Afterwards the line can be loaded and modified. This causes frequent evictions and HITM loads (Hit in another core's cache in modified state). In colloquial speech, the cache lines bounces between the caches back and forth.
This pattern is limited to parallel execution.
Symptoms
- Large discrepancy from performance model in parallel case.
- Bad scalability
Detection
You need to detect frequent (remote caches') CL evicts.
Many Intel architectures have hardware events to detect HITM accesses. The problem is that these events are not very accurate. Therefore, the listed events can be seen as a qualitative measurement for false sharing but no quantitative one.
- MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM (HITM cache lines in other core's L1/L2 cache but on the same CPU socket)
- MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM (HITM cache lines in remote CPU socket)
For some architectures, LIKWID provides a FALSE_SHARE group.
Possible optimizations and/or fixes
- Avoid the creation of data structures that are accessed by multiple threads for writes
- Global shared variables should be only used in readable fashion
Applicable applications or algorithms or kernels
- Naive implementation of histogram codes