Performance model
There is a wide range of completely different things coined under the term performance model. Here we adopt “the physics way” of using models in computer science, specifically for the interaction between hardware and software. A model in this sense is a “mathematical description” based on a simplified machine model that ignores most of the details of what is going on under the hood; it makes certain assumptions, which must be clearly specified so that the range of applicability of the model is entirely clear.
Introduction
For performance engineering the focus is on resource-based analytic loop performance models. To formulate a performance model generates knowledge about software-hardware interaction. Its main purpose is to come up with a quantitative estimate for an expected performance. Without an expected performance estimate it is impossible to decide on performance optimizations as there is no clear knowledge what aspect of software/hardware interaction limits the performance and what could be the optimal performance. An example process employing performance modelling is the performance Performance Patterns based performance engineering process. You formulate a model to estimate expected performance and compare this to application benchmarking. Additionally performance profiling may be used to validate model predictions. In case the validation fails either the profiling or performance measurement are wrong, the model assumptions are not met or the model inputs are wrong. During the process to bring the model estimate and the measured in-line the knowledge about software-hardware interaction is increased and therefore the trust in the performance analysis. It is clear that you will not set up a performance model for every loop kernel. And of course you can do a meaningful performance analysis without a model, e.g. based on rough upper limit estimates. Still for deciding on major code optimizations a model can give you the required confidence that you make a deliberate decision.
One premise that is often valid in scientific computing is the steady state assumption: Most programs comprise loops that are much longer than typical pipeline lengths or other hardware latencies. The execution in each of those loops can be seen as continuous streams of data (input and/or output) being manipulated by a continuous, periodic stream of instructions. The two resources offered by stored-program computers are: executing instructions and transferring data. The complexity in finding out the light-speed performance of a loop kernel on a specific processor is caused by the fact that the execution as well as data transfer rate is specific for the the code executed. The most popular model with this respect is the roofline model, whose basic concepts had been in use since the late 1980s (coined under the term balance metric). It was popularized and refined by Sam Williams in 2009.
Roofline model
The naïve Roofline is obtained by applying simple bound and bottleneck analysis. In this formulation of the Roofline model, there are only two parameters, the peak performance and the peak bandwidth of the specific architecture, and one variable, the code balance.
where is the attainable performance, is the peak performance, is the peak bandwidth and is the code balance. The code balance, i.e., a single number, is the part of the roofline model describing the code (the application model), while and characterize the machine (the machine model). Despite these bold simplifications, the strictly light-speed predictions of the model are not only absolute upper bounds but also accurate to a useful degree in many cases. In order to determine its area of applicability we have to state clearly the assumptions that go into it:
- Steady state assumption. This applies to practically all performance models: Start-up and wind-down effects can be neglected, and a continuous stream of instructions and data is processed by the CPU.
- Overlap assumption. Data transfers and arithmetic code execution overlap perfectly. As a consequence, the “slowest bottleneck” wins, no matter by which margin, and there is no interaction between bottlenecks.
- No-latency assumption. Data paths work at their highest achievable bandwidth, and latency effects are ignored.
- Saturation assumption. It must be possible, on the hardware at hand, to actually saturate the memory bandwidth.
The most crucial of these in view of the roofline model is certainly the overlap assumption. In fact, if we accept above formulation as a predictive model for runtime, we have to presume full overlap as well. The saturation assumption leads to an important consequence for practical applications: Since a single core can often not saturate the memory bandwidth of a chip, the model is best applied to the full chip, i.e, P peak is the full-chip (all cores) peak performance, and all bandwidth numbers pertain to the full chip as well.
ECM model
Links and further information
- Wikipedia article on the Roofline model
- Introduction to the ECM model