Difference between revisions of "MicroArchitecturalAnomalies"
(Created page with "== Description == == Symptoms == == Detection == == Possible optimizations and/or fixes == == Applicable applications or algorithms or kernels ==") |
m |
||
(2 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:Performance Pattern]] | ||
== Description == | == Description == | ||
+ | The pattern 'Micro architectural anomalies' describes performance limiting factors that are caused by the hardware design. Assuming your system has one load unit with one load per cycle, your code might be limited by this design decision when your code performs a lot of loads. Other cases are penalties caused in specific situations (mispredicted branches like to drain the pipeline and therefore cause some penalty). | ||
== Symptoms == | == Symptoms == | ||
+ | The symptoms can be various but in order to express it the most general way: Large discrepancy from performance model based on loads/stores and arithmetic throughput. | ||
== Detection == | == Detection == | ||
+ | Since these anomalies are all over the chip, there is no common way to detect them. | ||
+ | |||
+ | * Mispredicted branches: LIKWID group BRANCH | ||
+ | * Stalls: all events which match *STALL* like RESOURCE_STALLS_RS (Stalls at reservation station), RESOURCE_STALLS_SB (Stalls due to store buffer), ... | ||
+ | * Penelties: all events which match *CYCLES* like UOPS_ISSUED_STALL_CYCLES, UOPS_EXECUTED_STALL_CYCLES and UOPS_RETIRED_STALL_CYCLES | ||
== Possible optimizations and/or fixes == | == Possible optimizations and/or fixes == | ||
+ | If you can add a workaround that does not reduce the performance of your code, try it. | ||
== Applicable applications or algorithms or kernels == | == Applicable applications or algorithms or kernels == |
Latest revision as of 07:25, 4 September 2019
Description
The pattern 'Micro architectural anomalies' describes performance limiting factors that are caused by the hardware design. Assuming your system has one load unit with one load per cycle, your code might be limited by this design decision when your code performs a lot of loads. Other cases are penalties caused in specific situations (mispredicted branches like to drain the pipeline and therefore cause some penalty).
Symptoms
The symptoms can be various but in order to express it the most general way: Large discrepancy from performance model based on loads/stores and arithmetic throughput.
Detection
Since these anomalies are all over the chip, there is no common way to detect them.
- Mispredicted branches: LIKWID group BRANCH
- Stalls: all events which match *STALL* like RESOURCE_STALLS_RS (Stalls at reservation station), RESOURCE_STALLS_SB (Stalls due to store buffer), ...
- Penelties: all events which match *CYCLES* like UOPS_ISSUED_STALL_CYCLES, UOPS_EXECUTED_STALL_CYCLES and UOPS_RETIRED_STALL_CYCLES
Possible optimizations and/or fixes
If you can add a workaround that does not reduce the performance of your code, try it.