Difference between revisions of "Performance Engineering"

From HPC Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
HPC is about high application performance requirements. There exist many
+
HPC is about high application performance requirements. There are many
 
options to improve the performance of an application code. In the following it
 
options to improve the performance of an application code. In the following it
 
is assumed that a given algorithm is executed on a given HPC system.
 
is assumed that a given algorithm is executed on a given HPC system.
Line 25: Line 25:
 
most important one.
 
most important one.
  
Illustrating example: Consider the high level task of drilling 20 holes in
+
Illustrating example: Consider the high level task of drilling 20 holes into
 
a wall. At the beginning a single worker drills one hole after the other. The
 
a wall. At the beginning a single worker drills one hole after the other. The
 
first optimization would be to hire a worker who drills every hole quicker,
 
first optimization would be to hire a worker who drills every hole quicker,
that's increasing the execution rate. Next one could hire several workers, for
+
that's increasing the execution rate. Next, one could hire several workers, for
example 4, which drill holes concurrently. Ideally the work should than be
+
example 4, which drill holes concurrently. Ideally the work should then be
 
finished in a quarter of the time one worker requires. Finally you could buy
 
finished in a quarter of the time one worker requires. Finally you could buy
 
a special purpose drilling machine which can drill five holes in one step. Now
 
a special purpose drilling machine which can drill five holes in one step. Now
Line 38: Line 38:
 
=== Mapping of work on computer resources ===
 
=== Mapping of work on computer resources ===
  
In order to understand the following parts it is important to understand the relation between work, execution resources
+
In order to understand the following parts, it is important to understand the relation between work, execution resources
 
and time to solution. The application generates work to accomplish a high level task.
 
and time to solution. The application generates work to accomplish a high level task.
 
What this work actually is is specific to the application. It may be requests answered, voxels processed or lattice points updated.
 
What this work actually is is specific to the application. It may be requests answered, voxels processed or lattice points updated.
To accomplish this work the application uses the resources offered by the computer.
+
To accomplish this work, the application uses the resources offered by the computer.
A computers only notion of work are instructions and, triggered by
+
A computer's only notion of work are instructions and, triggered by
 
instructions, data transfers from entities that can store data to registers.
 
instructions, data transfers from entities that can store data to registers.
 
This fact complicates performance optimization as execution efficiency always
 
This fact complicates performance optimization as execution efficiency always
Line 51: Line 51:
 
=== Runtime contributions and the critical path ===
 
=== Runtime contributions and the critical path ===
  
Everything happing during program execution takes time. The time some activity or task takes is called a runtime contribution.
+
Everything happening during program execution takes time. The time some activity or task takes is called a runtime contribution.
In the simplest case runtime contributions form the total time to solution by simply being summed up. Still on modern computers many things happen concurrently and different runtime contributions might overlap with each other.
+
In the simplest case, runtime contributions form the total time to solution by simply being summed up. Still on modern computers many things happen concurrently and different runtime contributions might overlap each other.
  
The critical path is the series of runtime contributions which can not be
+
The critical path is the series of runtime contributions which can not
overlapped with each other and forms the total runtime. A runtime contribution
+
overlap each other and forms the total runtime. A runtime contribution
 
that adds to the total execution time is said to be on the critical path.
 
that adds to the total execution time is said to be on the critical path.
 
Anything that takes time is only relevant for optimization if it appears on the
 
Anything that takes time is only relevant for optimization if it appears on the
Line 65: Line 65:
  
 
Parallel execution implies additional constraints and overheads. Work must
 
Parallel execution implies additional constraints and overheads. Work must
consist of tasks that can be executed concurrently. In simple cases tasks are
+
consist of tasks that can be executed concurrently. In simple cases, tasks are
embarrassingly parallel but in most relevant applications there exist
+
embarrassingly parallel but in most relevant applications there are
 
dependencies between tasks which require synchronization or communication of
 
dependencies between tasks which require synchronization or communication of
 
data. Often in order to exploit parallelism additional work has to be
 
data. Often in order to exploit parallelism additional work has to be
Line 85: Line 85:
  
 
* Define a relevant test case which reflects production behavior
 
* Define a relevant test case which reflects production behavior
* Aquire runtime profile to determine on which parts of the code the processing time is spent
+
* Acquire runtime profile to determine on which parts of the code the processing time is spent
 
* For all code parts (hot spots) of the runtime profile perform:
 
* For all code parts (hot spots) of the runtime profile perform:
 
** Static code analysis
 
** Static code analysis
 
** Application benchmarking
 
** Application benchmarking
 
** Parallel case: Create and analyse runtime trace
 
** Parallel case: Create and analyse runtime trace
** Instrumentation based hardware performance counter profiling
+
** Instrumentation-based hardware performance counter profiling
* Based on the data aquired by above activities narrow down performance issues
+
* Based on the data acquired by above activities narrow down performance issues
 
* Improve performance by changing runtime setup or implementation
 
* Improve performance by changing runtime setup or implementation
  
Line 98: Line 98:
 
optimised variants are used and taking effect in regular production.
 
optimised variants are used and taking effect in regular production.
  
To carry out above procedure multiple special skills beyond standard software
+
To carry out above procedure, multiple special skills beyond standard software
 
engineering are required:
 
engineering are required:
  
Line 111: Line 111:
  
 
After definition of a benchmark case, application benchmarking and performance
 
After definition of a benchmark case, application benchmarking and performance
profiling the interpretation and analysis of the results is the first difficult
+
profiling, the interpretation and analysis of the results is the first difficult
 
task in any performance engineering effort. While there is no silver bullet for
 
task in any performance engineering effort. While there is no silver bullet for
performance analysis multiple strategies provide guidelines for different
+
performance analysis, multiple strategies provide guidelines for different
 
levels of expertise. It must be noted that in complicated cases the software
 
levels of expertise. It must be noted that in complicated cases the software
 
developer carrying out the process must possess a certain level of experience
 
developer carrying out the process must possess a certain level of experience
Line 122: Line 122:
 
Three approaches are described in more detail:
 
Three approaches are described in more detail:
  
* [[ProPE PE Process]]: Threshold based performance analysis process based on the proven EU COE [[https://pop-coe.eu/further-information/learning-material POP project]] approach for a rough initial performance analysis suited also for beginners
+
* [[ProPE PE Process]]: Threshold-based performance analysis process based on the proven EU COE [[https://pop-coe.eu/further-information/learning-material POP project]] approach for a rough initial performance analysis suited also for beginners
* [[Performance Patterns]]: Performance pattern based process for more complicated cases targeted at experienced software developers
+
* [[Performance Patterns]]: Performance-pattern-based process for more complicated cases targeted at experienced software developers
* [[Instruction Counting]]: An instruction count based approach applicable for the special case of instruction throughput limited codes on SIMD architecture
+
* [[Instruction Counting]]: An instruction-count-based approach applicable for the special case of instruction throughput limited codes on SIMD architecture

Revision as of 15:05, 24 January 2019

Introduction

HPC is about high application performance requirements. There are many options to improve the performance of an application code. In the following it is assumed that a given algorithm is executed on a given HPC system.

The following factors influence the performance:

  • Implementation of the algorithm (programming language, optimisation techniques)
  • Compiler used and compiler options
  • Machine and operating system configuration
  • Runtime setup (pinning and resource allocation)

Basic principles of application performance

Strategies to speedup execution

There are three elementary ways to speed up execution on a computer:

  • Increase the execution rate (on computers increase clock)
  • Introduce parallel (concurrent) execution
  • Specialization of execution resources

In computers all three strategies are used with parallel execution being the most important one.

Illustrating example: Consider the high level task of drilling 20 holes into a wall. At the beginning a single worker drills one hole after the other. The first optimization would be to hire a worker who drills every hole quicker, that's increasing the execution rate. Next, one could hire several workers, for example 4, which drill holes concurrently. Ideally the work should then be finished in a quarter of the time one worker requires. Finally you could buy a special purpose drilling machine which can drill five holes in one step. Now one worker can accomplish five times as much work in the same time. Of course you can combine all of this: Hire multiple fast workers and equip them with the special purpose drilling machine.

Mapping of work on computer resources

In order to understand the following parts, it is important to understand the relation between work, execution resources and time to solution. The application generates work to accomplish a high level task. What this work actually is is specific to the application. It may be requests answered, voxels processed or lattice points updated. To accomplish this work, the application uses the resources offered by the computer. A computer's only notion of work are instructions and, triggered by instructions, data transfers from entities that can store data to registers. This fact complicates performance optimization as execution efficiency always happens on the instruction level. The application work has to be mapped on instruction work and this mapping might add overhead not related to the application work. The resulting instruction and resulting data transfer work can then be mapped on computer resources. The mapping finally gives the time to solution.

Runtime contributions and the critical path

Everything happening during program execution takes time. The time some activity or task takes is called a runtime contribution. In the simplest case, runtime contributions form the total time to solution by simply being summed up. Still on modern computers many things happen concurrently and different runtime contributions might overlap each other.

The critical path is the series of runtime contributions which can not overlap each other and forms the total runtime. A runtime contribution that adds to the total execution time is said to be on the critical path. Anything that takes time is only relevant for optimization if it appears on the critical path.

Parallel execution

Parallelism happens on many levels on computers. For performance engineering one has to separate what happens within one instruction stream (executed on one processor core) and effects introduced by solving a problem using multiple of such instruction streams.

Parallel execution implies additional constraints and overheads. Work must consist of tasks that can be executed concurrently. In simple cases, tasks are embarrassingly parallel but in most relevant applications there are dependencies between tasks which require synchronization or communication of data. Often in order to exploit parallelism additional work has to be introduced to efficiently implement it.

The following factors limit the achievable speedup by parallel execution:

  • Load balancing: Can work be perfectly distributed on parallel execution resources?
  • Dependencies: Are there dependencies that add to the critical path?
  • Communication overhead: Is there any data transfer time that adds to the critical path?
  • Instruction overhead: Additional work that is required to implement an efficient parallel execution.

From the standpoint of a single instructions stream all those influences but the last one (which adds to the processor work) are experienced as additional waiting times which may or may not add to the critical path during execution.

Generic iterative procedure for performance engineering

The following steps are required for a minimum performance engineering process:

  • Define a relevant test case which reflects production behavior
  • Acquire runtime profile to determine on which parts of the code the processing time is spent
  • For all code parts (hot spots) of the runtime profile perform:
    • Static code analysis
    • Application benchmarking
    • Parallel case: Create and analyse runtime trace
    • Instrumentation-based hardware performance counter profiling
  • Based on the data acquired by above activities narrow down performance issues
  • Improve performance by changing runtime setup or implementation

Those steps need to be repeated multiple times until a required or good enough performance is reached. After applying an optimisation it must be ensured that the optimised variants are used and taking effect in regular production.

To carry out above procedure, multiple special skills beyond standard software engineering are required:

Those skills are documented in separate articles and will be expected in the following.

Strategies for performance analysis

After definition of a benchmark case, application benchmarking and performance profiling, the interpretation and analysis of the results is the first difficult task in any performance engineering effort. While there is no silver bullet for performance analysis, multiple strategies provide guidelines for different levels of expertise. It must be noted that in complicated cases the software developer carrying out the process must possess a certain level of experience to succeed. Therefore it is recommended to consult an experienced HPC consultant in the local HPC center if no progress is achieved using the simpler approaches.

Three approaches are described in more detail:

  • ProPE PE Process: Threshold-based performance analysis process based on the proven EU COE [POP project] approach for a rough initial performance analysis suited also for beginners
  • Performance Patterns: Performance-pattern-based process for more complicated cases targeted at experienced software developers
  • Instruction Counting: An instruction-count-based approach applicable for the special case of instruction throughput limited codes on SIMD architecture