Difference between revisions of "Performance Engineering"
m |
|||
(40 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:HPC-Developer]] | ||
+ | [[Category:PE5 - Optimization Cycle]] | ||
+ | [[Category:Benchmarking]] | ||
+ | |||
== Introduction == | == Introduction == | ||
− | HPC is about high application performance requirements. There | + | HPC is about high application performance requirements. There are lots of |
− | options to improve | + | options available to improve application performance, this also includes algorithmic optimizations. |
− | is assumed that a given algorithm is executed on a given HPC system. | + | In this article it is assumed that a given algorithm is executed on a given HPC system. |
− | The following factors influence | + | The following factors influence performance: |
− | * Implementation | + | * Hardware design (size and layout of caches, pipelining, vector length, nowadays also up/downclocking rules) |
− | * | + | * Implementation (programming language, layout of data structures, cost of different program constructs, programming and optimization techniques, libraries variety) |
− | * | + | * Code generation (compiler (vendor and version) used and compiler options) |
− | * | + | * System configuration (Operating System choice, including OS settings, system libraries used) |
+ | * Execution environment (process/thread affinity and resource allocation, OS jitter) | ||
== Basic principles of application performance == | == Basic principles of application performance == | ||
Line 16: | Line 21: | ||
=== Strategies to speedup execution === | === Strategies to speedup execution === | ||
− | There | + | There exist three elementary ways to speed up code execution: |
− | * Increase the execution rate ( | + | * Increase the execution rate (means increase clock - physical limits reached shortly after Millennium at aroung 3GHz) |
− | * Introduce parallel (concurrent) execution | + | * Introduce parallel (concurrent) execution, maybe on nested levels |
− | * Specialization of execution resources | + | * Specialization of execution resources (reduce the cost and allow increase of parallelism) |
− | In computers all three strategies are used with parallel execution being the | + | In modern computers all three strategies are used, with parallel execution being the |
most important one. | most important one. | ||
− | Illustrating example: Consider the high level task of drilling 20 holes | + | Illustrating example: Consider the high level task of drilling 20 holes into |
a wall. At the beginning a single worker drills one hole after the other. The | a wall. At the beginning a single worker drills one hole after the other. The | ||
− | first optimization would be to hire a worker who drills every hole quicker | + | first optimization would be to hire a worker, who drills every hole quicker. |
− | + | That's increasing the execution rate. Next, one could hire several workers, for | |
− | example 4, which drill holes concurrently. Ideally the work should | + | example 4, which drill holes concurrently. Ideally the work should then be |
− | finished in a quarter of the time | + | finished in a quarter of the time. Finally you could buy |
a special purpose drilling machine which can drill five holes in one step. Now | a special purpose drilling machine which can drill five holes in one step. Now | ||
one worker can accomplish five times as much work in the same time. Of course | one worker can accomplish five times as much work in the same time. Of course | ||
Line 36: | Line 41: | ||
special purpose drilling machine. | special purpose drilling machine. | ||
− | === Mapping | + | === Mapping work on resources === |
− | In order to understand the following parts it is important to | + | In order to understand the following parts, it is important to introduce the basic relation between work, execution resources |
− | and time to solution. | + | and time to solution. It is required to execute work to accomplish an applications high level task. |
− | + | The work unit is application specific. It may be requests answered, voxels processed or lattice points updated per unit of time. | |
− | To accomplish this work the application | + | To accomplish this work, the application utilizes computer resources. |
− | + | The computer's only notion of work are instructions and, triggered by | |
− | instructions, data transfers from entities that can | + | instructions, data transfers from entities that can provide data to the registers. |
This fact complicates performance optimization as execution efficiency always | This fact complicates performance optimization as execution efficiency always | ||
− | happens on the instruction level. | + | happens on the instruction level. Application work has to be mapped on instruction work, and this mapping may introduce overhead in form of additional instructions. The resulting instruction and data transfer work are finally mapped on compute resources. This mapping results in time to solution. |
− | |||
− | |||
− | === Runtime contributions and | + | === Runtime contributions and critical path === |
− | Everything | + | Everything happening during program execution takes time. The time some activity or task takes is called a runtime contribution. |
− | In the simplest case runtime contributions form the total time to solution by simply being summed up. | + | In the simplest case, runtime contributions form the total time to solution by simply being summed up. On modern computers, however, many things happen concurrently where different runtime contributions can overlap with each other. The critical path is the series of runtime contributions, which cannot overlap with each other, and form the total runtime. A runtime contribution |
− | |||
− | The critical path is the series of runtime contributions which | ||
− | |||
that adds to the total execution time is said to be on the critical path. | that adds to the total execution time is said to be on the critical path. | ||
− | Anything that takes time is only relevant for | + | Anything that takes time is only relevant for optimisation if it appears on the |
critical path. | critical path. | ||
=== Parallel execution === | === Parallel execution === | ||
− | Parallelism happens on many levels on computers. For performance engineering one has to separate what happens within one instruction stream (executed on one processor core) and effects | + | Parallelism happens on many levels on computers. For performance engineering one has to separate what happens within one instruction stream (executed on one processor core), and effects introduced by solving a problem using multiple of such instruction streams. |
Parallel execution implies additional constraints and overheads. Work must | Parallel execution implies additional constraints and overheads. Work must | ||
− | consist of tasks that can be executed concurrently. In simple cases tasks are | + | consist of tasks that can be executed concurrently. In simple cases, tasks are |
− | embarrassingly parallel but in most relevant applications there | + | embarrassingly parallel, but in most relevant applications there are |
− | dependencies between tasks which require | + | dependencies between tasks, which require synchronisation or communication of |
data. Often in order to exploit parallelism additional work has to be | data. Often in order to exploit parallelism additional work has to be | ||
introduced to efficiently implement it. | introduced to efficiently implement it. | ||
Line 73: | Line 73: | ||
The following factors limit the achievable speedup by parallel execution: | The following factors limit the achievable speedup by parallel execution: | ||
− | + | * '''Load balancing''': Can work be perfectly distributed on parallel execution resources? | |
− | + | * '''Dependencies''': Exist dependencies that add to the critical path? | |
− | + | * '''Communication overhead''': Is there any data transfer time that adds to the critical path? | |
− | + | * '''Instruction overhead''': Additional work that is required to implement an efficient parallel execution. | |
− | |||
− | |||
− | |||
− | From the standpoint of a single | + | From the standpoint of a single instruction stream all those influences but the last one (which adds to the processor work) are experienced as additional waiting times which may or may not add to the critical path during execution. |
− | == | + | == Basic iterative procedure for performance engineering == |
The following steps are required for a minimum performance engineering process: | The following steps are required for a minimum performance engineering process: | ||
− | * Define a relevant test case which reflects production | + | * Define a [[relevant test case]] which reflects production behaviour |
− | * | + | * Start from Scratch: don't try to get the ''best'' result at first step. |
+ | ** Avoid witless Ctrl-C+Ctrl-V of any scripts - this will keep you from stupid issues like using 32 cores of 48 available on a node and losing a third of ressources thus. | ||
+ | ** [http://www.paulgraham.com/knuth.html Premature optimization is the root of all evil]. Avoid advanced optimisation at the very beginning, e.g. pinning (you can loss much more when making an optimisation in a wrong way in comparison to improvement you can accieve from vanilla state). | ||
+ | * Acquire a runtime profile to determine on which parts of the code the runtime time is spent | ||
* For all code parts (hot spots) of the runtime profile perform: | * For all code parts (hot spots) of the runtime profile perform: | ||
** Static code analysis | ** Static code analysis | ||
** Application benchmarking | ** Application benchmarking | ||
− | ** Parallel case: Create and analyse runtime | + | ** Parallel case: Create and analyse runtime traces |
− | ** | + | ** Perform hardware performance counter profiling |
− | * Based on the data | + | * Based on the data acquired by above activities narrow down performance issues |
− | * Improve performance by changing runtime | + | * Optional: Formulate an analytic performance model to develop a profound performance expectation |
+ | * Improve performance by changing runtime configuration or implementation | ||
− | Those steps need to be repeated multiple times until a required or good enough | + | Those steps need to be repeated multiple times for every hot spot until a required or good enough |
− | performance is reached. After applying an optimisation it must be ensured that the | + | performance is reached. After applying an optimisation, it must be ensured that the |
− | optimised variants are used and taking effect in regular production. | + | optimised variants are used and also taking effect in regular production. |
− | To carry out above procedure multiple special skills | + | To carry out above procedure, multiple special skills and knowledge areas are required: |
− | |||
* Perform [[Application benchmarking|application benchmarking]] | * Perform [[Application benchmarking|application benchmarking]] | ||
+ | * Perform [[Micro benchmarking|micro benchmarking]] | ||
* Create a [[Runtime profiling|runtime profile]] | * Create a [[Runtime profiling|runtime profile]] | ||
* Create a [[Performance profiling|performance profile]] | * Create a [[Performance profiling|performance profile]] | ||
+ | * Formulate a [[Performance model|performance model]] | ||
+ | * Plan and apply [[Performance optimisations|performance optimisations]] | ||
− | + | Moreover a performance engineer has to know about application algorithms, processor architecture, operating system internals, programming models and their implementation as well as compiler behavior and options. To complicate things even further many of those knowledge areas are moving targets, where issues depend on the version of the OS or compiler, not to speak of features introduced with new processor architectures. A thorough knowledge about [[Computer architecture for software developers|computer architecture for software developers]] is essential to be successful. | |
− | |||
== Strategies for performance analysis == | == Strategies for performance analysis == | ||
After definition of a benchmark case, application benchmarking and performance | After definition of a benchmark case, application benchmarking and performance | ||
− | profiling the interpretation and analysis of the results is the first difficult | + | profiling, the interpretation and analysis of the results is the first difficult |
task in any performance engineering effort. While there is no silver bullet for | task in any performance engineering effort. While there is no silver bullet for | ||
− | performance analysis multiple strategies provide guidelines for different | + | performance analysis, multiple strategies provide guidelines for different |
levels of expertise. It must be noted that in complicated cases the software | levels of expertise. It must be noted that in complicated cases the software | ||
developer carrying out the process must possess a certain level of experience | developer carrying out the process must possess a certain level of experience | ||
Line 123: | Line 125: | ||
approaches. | approaches. | ||
− | + | We introduce two approaches: | |
+ | |||
+ | * [[ProPE PE Process]]: Threshold-based performance analysis based on the proven EU COE [[https://pop-coe.eu/further-information/learning-material POP project]] approach for a rough initial performance analysis suited also for beginners | ||
+ | * [[Performance Patterns]]: Performance-pattern-based process for more complicated cases targeted at experienced software developers | ||
+ | |||
+ | A detailed [[Performance Pattern List]] gives more insight into various patterns. | ||
+ | |||
+ | == Performance Logbook == | ||
+ | |||
+ | Documentation of activities, settings and results is crucial in any performance engineering effort. It helps to keep track of changes and results, make progress in a deterministic way, and to decide what next steps are required. We created a performance logbook template that already has placeholders for all important activities during a performance engineering effort. The logbook consists of a central text document in markdown format together with folders for referenced figures, dumped settings and results and everything else that is relevant to be saved during the process. | ||
+ | |||
+ | The logbook is targeted at computing centers carrying out customer projects, but can be useful to anyone doing a performance project. | ||
+ | The logbook consists of a preamble with meta information about the project, the software and the people involved, multiple analyst sessions and a summary. | ||
+ | Every analyst session it self consists of one or more benchmarking blocks. | ||
+ | |||
+ | There are five preconfigured activities within a benchmarking block: | ||
+ | * Runtime Profile - Create, analyse and discuss a runtime profile | ||
+ | * Performance Profile - Create, analyse and discuss a performance profile of any kind. | ||
+ | * Result - Application benchmarking runs. | ||
+ | * Analysis - Putting together analysis, observations and further planning based on previous activities. | ||
+ | * Optimisation - Any changes that are attempted to improve performance. This can be for example changing runtime settings or modifying the code. | ||
+ | |||
+ | You find the performance logbook template including examples [https://github.com/RRZE-HPC/ThePerformanceLogbook here]. | ||
+ | |||
+ | == Related pages == | ||
+ | |||
+ | * [[Computer architecture for software developers]] | ||
+ | * [[Application benchmarking]] | ||
+ | * [[Micro benchmarking]] | ||
+ | * [[Runtime profiling]] | ||
+ | * [[Performance profiling]] | ||
+ | * [[Performance model]] | ||
+ | * [[Performance optimisations]] | ||
+ | * [[ProPE PE Process]] | ||
+ | * [[Performance Patterns]] based PE process | ||
+ | * [[Performance Pattern List]] | ||
+ | |||
+ | == Links and further information == | ||
− | * | + | * [https://blogs.fau.de/hager/talks/nlpe Nodel-level Performance-Engineering] Tutorial material of HPC group at RRZE: Covers all relevant topics in much detail. Slides are available for download. |
− | |||
− |
Latest revision as of 12:28, 19 July 2024
Introduction
HPC is about high application performance requirements. There are lots of options available to improve application performance, this also includes algorithmic optimizations. In this article it is assumed that a given algorithm is executed on a given HPC system.
The following factors influence performance:
- Hardware design (size and layout of caches, pipelining, vector length, nowadays also up/downclocking rules)
- Implementation (programming language, layout of data structures, cost of different program constructs, programming and optimization techniques, libraries variety)
- Code generation (compiler (vendor and version) used and compiler options)
- System configuration (Operating System choice, including OS settings, system libraries used)
- Execution environment (process/thread affinity and resource allocation, OS jitter)
Basic principles of application performance
Strategies to speedup execution
There exist three elementary ways to speed up code execution:
- Increase the execution rate (means increase clock - physical limits reached shortly after Millennium at aroung 3GHz)
- Introduce parallel (concurrent) execution, maybe on nested levels
- Specialization of execution resources (reduce the cost and allow increase of parallelism)
In modern computers all three strategies are used, with parallel execution being the most important one.
Illustrating example: Consider the high level task of drilling 20 holes into a wall. At the beginning a single worker drills one hole after the other. The first optimization would be to hire a worker, who drills every hole quicker. That's increasing the execution rate. Next, one could hire several workers, for example 4, which drill holes concurrently. Ideally the work should then be finished in a quarter of the time. Finally you could buy a special purpose drilling machine which can drill five holes in one step. Now one worker can accomplish five times as much work in the same time. Of course you can combine all of this: Hire multiple fast workers and equip them with the special purpose drilling machine.
Mapping work on resources
In order to understand the following parts, it is important to introduce the basic relation between work, execution resources and time to solution. It is required to execute work to accomplish an applications high level task. The work unit is application specific. It may be requests answered, voxels processed or lattice points updated per unit of time. To accomplish this work, the application utilizes computer resources. The computer's only notion of work are instructions and, triggered by instructions, data transfers from entities that can provide data to the registers. This fact complicates performance optimization as execution efficiency always happens on the instruction level. Application work has to be mapped on instruction work, and this mapping may introduce overhead in form of additional instructions. The resulting instruction and data transfer work are finally mapped on compute resources. This mapping results in time to solution.
Runtime contributions and critical path
Everything happening during program execution takes time. The time some activity or task takes is called a runtime contribution. In the simplest case, runtime contributions form the total time to solution by simply being summed up. On modern computers, however, many things happen concurrently where different runtime contributions can overlap with each other. The critical path is the series of runtime contributions, which cannot overlap with each other, and form the total runtime. A runtime contribution that adds to the total execution time is said to be on the critical path. Anything that takes time is only relevant for optimisation if it appears on the critical path.
Parallel execution
Parallelism happens on many levels on computers. For performance engineering one has to separate what happens within one instruction stream (executed on one processor core), and effects introduced by solving a problem using multiple of such instruction streams.
Parallel execution implies additional constraints and overheads. Work must consist of tasks that can be executed concurrently. In simple cases, tasks are embarrassingly parallel, but in most relevant applications there are dependencies between tasks, which require synchronisation or communication of data. Often in order to exploit parallelism additional work has to be introduced to efficiently implement it.
The following factors limit the achievable speedup by parallel execution:
- Load balancing: Can work be perfectly distributed on parallel execution resources?
- Dependencies: Exist dependencies that add to the critical path?
- Communication overhead: Is there any data transfer time that adds to the critical path?
- Instruction overhead: Additional work that is required to implement an efficient parallel execution.
From the standpoint of a single instruction stream all those influences but the last one (which adds to the processor work) are experienced as additional waiting times which may or may not add to the critical path during execution.
Basic iterative procedure for performance engineering
The following steps are required for a minimum performance engineering process:
- Define a relevant test case which reflects production behaviour
- Start from Scratch: don't try to get the best result at first step.
- Avoid witless Ctrl-C+Ctrl-V of any scripts - this will keep you from stupid issues like using 32 cores of 48 available on a node and losing a third of ressources thus.
- Premature optimization is the root of all evil. Avoid advanced optimisation at the very beginning, e.g. pinning (you can loss much more when making an optimisation in a wrong way in comparison to improvement you can accieve from vanilla state).
- Acquire a runtime profile to determine on which parts of the code the runtime time is spent
- For all code parts (hot spots) of the runtime profile perform:
- Static code analysis
- Application benchmarking
- Parallel case: Create and analyse runtime traces
- Perform hardware performance counter profiling
- Based on the data acquired by above activities narrow down performance issues
- Optional: Formulate an analytic performance model to develop a profound performance expectation
- Improve performance by changing runtime configuration or implementation
Those steps need to be repeated multiple times for every hot spot until a required or good enough performance is reached. After applying an optimisation, it must be ensured that the optimised variants are used and also taking effect in regular production.
To carry out above procedure, multiple special skills and knowledge areas are required:
- Perform application benchmarking
- Perform micro benchmarking
- Create a runtime profile
- Create a performance profile
- Formulate a performance model
- Plan and apply performance optimisations
Moreover a performance engineer has to know about application algorithms, processor architecture, operating system internals, programming models and their implementation as well as compiler behavior and options. To complicate things even further many of those knowledge areas are moving targets, where issues depend on the version of the OS or compiler, not to speak of features introduced with new processor architectures. A thorough knowledge about computer architecture for software developers is essential to be successful.
Strategies for performance analysis
After definition of a benchmark case, application benchmarking and performance profiling, the interpretation and analysis of the results is the first difficult task in any performance engineering effort. While there is no silver bullet for performance analysis, multiple strategies provide guidelines for different levels of expertise. It must be noted that in complicated cases the software developer carrying out the process must possess a certain level of experience to succeed. Therefore it is recommended to consult an experienced HPC consultant in the local HPC center if no progress is achieved using the simpler approaches.
We introduce two approaches:
- ProPE PE Process: Threshold-based performance analysis based on the proven EU COE [POP project] approach for a rough initial performance analysis suited also for beginners
- Performance Patterns: Performance-pattern-based process for more complicated cases targeted at experienced software developers
A detailed Performance Pattern List gives more insight into various patterns.
Performance Logbook
Documentation of activities, settings and results is crucial in any performance engineering effort. It helps to keep track of changes and results, make progress in a deterministic way, and to decide what next steps are required. We created a performance logbook template that already has placeholders for all important activities during a performance engineering effort. The logbook consists of a central text document in markdown format together with folders for referenced figures, dumped settings and results and everything else that is relevant to be saved during the process.
The logbook is targeted at computing centers carrying out customer projects, but can be useful to anyone doing a performance project. The logbook consists of a preamble with meta information about the project, the software and the people involved, multiple analyst sessions and a summary. Every analyst session it self consists of one or more benchmarking blocks.
There are five preconfigured activities within a benchmarking block:
- Runtime Profile - Create, analyse and discuss a runtime profile
- Performance Profile - Create, analyse and discuss a performance profile of any kind.
- Result - Application benchmarking runs.
- Analysis - Putting together analysis, observations and further planning based on previous activities.
- Optimisation - Any changes that are attempted to improve performance. This can be for example changing runtime settings or modifying the code.
You find the performance logbook template including examples here.
Related pages
- Computer architecture for software developers
- Application benchmarking
- Micro benchmarking
- Runtime profiling
- Performance profiling
- Performance model
- Performance optimisations
- ProPE PE Process
- Performance Patterns based PE process
- Performance Pattern List
Links and further information
- Nodel-level Performance-Engineering Tutorial material of HPC group at RRZE: Covers all relevant topics in much detail. Slides are available for download.