FalseSharing

From HPC Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Description

The pattern 'False sharing' describes a performance limitation caused by cache coherency protocols. If a single core loads and modifies a cache lines, the line is loaded once to the L1 cache and modified there. The cache replacement algorithm or explicit stores evict the data from L1 to memory. But if multiple cores try to modify the same cache line, the line is loaded by one core into its L1 cache and modified there. The state of the cache line is changed to 'modified'. The state is available in all cache levels where the line resides (e.g. for inclusive L2 caches: all lines in L1 are also in L2 cache). If another core wants to load the line for modification, the lookup returns the cache line in a 'modified' state and therefore the modification from the other core has to be evicted until reaching a cache level which is shared by both accessing cores. Afterwards the line can be loaded and modified. This causes frequent evictions and HITM loads (Hit in another core's cache in modified state). In colloquial speech, the cache lines bounces between the caches back and forth.

This pattern is limited to parallel execution.

Symptoms

  • Large discrepancy from performance model in parallel case.
  • Bad scalability


Detection

You need to detect frequent (remote caches') CL evicts.

Many Intel architectures have hardware events to detect HITM accesses. The problem is that these events are not very accurate. Therefore, the listed events can be seen as a qualitative measurement for false sharing but no quantitative one.

  • MEM_LOAD_UOPS_L3_HIT_RETIRED_XSNP_HITM (HITM cache lines in other core's L1/L2 cache but on the same CPU socket)
  • MEM_LOAD_UOPS_L3_MISS_RETIRED_REMOTE_HITM (HITM cache lines in remote CPU socket)

For some architectures, LIKWID provides a FALSE_SHARE group.

Possible optimizations and/or fixes

  • Avoid the creation of data structures that are accessed by multiple threads for writes
  • Global shared variables should be only used in readable fashion


Applicable applications or algorithms or kernels

  • Naive implementation of histogram codes