Performance model

From HPC Wiki
Jump to navigation Jump to search

There is a wide range of completely different things coined under the term performance model. Here we adopt “the physics way” of using models in computer science, specifically for the interaction between hardware and software. A model in this sense is a “mathematical description” based on a simplified machine model that ignores most of the details of what is going on under the hood; it makes certain assumptions, which must be clearly specified so that the range of applicability of the model is entirely clear.

Introduction

For performance engineering the focus is on resource-based analytic loop performance models. To formulate a performance model generates knowledge about software-hardware interaction. Its main purpose is to come up with a quantitative estimate for an expected performance. Without an expected performance estimate it is impossible to decide on performance optimizations as there is no clear knowledge what aspect of software/hardware interaction limits the performance and what could be the optimal performance. An example process employing performance modelling is the performance Performance Patterns based performance engineering process. You formulate a model to estimate expected performance and compare this to application benchmarking. Additionally performance profiling may be used to validate model predictions. In case the validation fails either the profiling or performance measurement are wrong, the model assumptions are not met or the model inputs are wrong. During the process to bring the model estimate and the measured in-line the knowledge about software-hardware interaction is increased and therefore the trust in the performance analysis. It is clear that you will not set up a performance model for every loop kernel. And of course you can do a meaningful performance analysis without a model, e.g. based on rough upper limit estimates. Still for deciding on major code optimizations a model can give you the required confidence that you make a deliberate decision.

One premise that is often valid in scientific computing is the steady state assumption: Most programs comprise loops that are much longer than typical pipeline lengths or other hardware latencies. The execution in each of those loops can be seen as continuous streams of data (input and/or output) being manipulated by a continuous, periodic stream of instructions. The two resources offered by stored-program computers are: executing instructions and transferring data. The complexity in finding out the light-speed performance of a loop kernel on a specific processor is caused by the fact that the execution as well as data transfer rate is specific for the the code executed. The most popular model with this respect is the roofline model, whose basic concepts had been in use since the late 1980s (coined under the term balance metric). It was popularized and refined by Sam Williams in 2009.

Roofline model

The naïve Roofline is obtained by applying simple bound and bottleneck analysis. In this formulation of the Roofline model, there are only two parameters, the peak performance and the peak bandwidth of the specific architecture, and one variable, the code balance.

where is the attainable performance, is the peak performance, is the peak bandwidth and is the code balance, where is the number of flops in the loop (i.e., the amount of “work”) and is the amount of data transferred through the memory interface. The code balance, i.e., a single number, is the part of the roofline model describing the code (the application model), while and characterize the machine (the machine model). Despite these bold simplifications, the strictly light-speed predictions of the model are not only absolute upper bounds but also accurate to a useful degree in many cases. In order to determine its area of applicability we have to state clearly the assumptions that go into it:

  • Steady state assumption. This applies to practically all performance models: Start-up and wind-down effects can be neglected, and a continuous stream of instructions and data is processed by the CPU.
  • Overlap assumption. Data transfers and arithmetic code execution overlap perfectly. As a consequence, the “slowest bottleneck” wins, no matter by which margin, and there is no interaction between bottlenecks.
  • No-latency assumption. Data paths work at their highest achievable bandwidth, and latency effects are ignored.
  • Saturation assumption. It must be possible, on the hardware at hand, to actually saturate the memory bandwidth.

The most crucial of these in view of the roofline model is certainly the overlap assumption. In fact, if we accept above formulation as a predictive model for runtime, we have to presume full overlap as well. The saturation assumption leads to an important consequence for practical applications: Since a single core can often not saturate the memory bandwidth of a chip, the model is best applied to the full chip, i.e, P peak is the full-chip (all cores) peak performance, and all bandwidth numbers pertain to the full chip as well.

This relation can be visualized in a graphical representation:

The machine parameters in the roofline formula determine a function that is specific for a given machine, and from which the performance limit can be read off easily for any value of . For historical reasons we use intensity, on the x axis the reciprocal of the memory code balance, , the computational intensity so that the performance function reads Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P = min(P_\text{peak} , I · b} . The Figure shows clearly how this representation came to serve as a name giver for the model. The upper limit (drawn as a solid line) has two branches, or ceilings: At low intensity (i.e., high code balance), where P = I ·b, performance is linear in with a slope of b. Increasing the intensity (i.e., reducing the code balance) leads to a proportional gain in performance. This is the memory-bound ceiling, where the code does little work for every byte of data and is thus limited by memory bandwidth. At high intensity (i.e., low code balance), we get because the code is busy doing arithmetic with little “hunger” for data, and the memory interface can keep up. This is the compute-bound ceiling. Note that the roofline diagram is usually drawn with logarithmic scales. This has the interesting consequence that a change in memory bandwidth, which would lead to a different slope in a linear graph, causes a parallel shift of the bandwidth ceiling.

ECM model

Links and further information