Micro benchmarking
Microbenchmarking is about measuring the time or performance of small to very small building blocks of real programs. This can be a common data access pattern, a sequence of operations or even a single instruction.
Introduction
Microbenchmarking is an indispensable tool in performance engineering, which fulfills many purposes. Among other things it :
- provides upper performance limits for sustained performance
- creates knowledge about performance behavior
- helps finding performance bugs in architectures
- provides undocumented processor performance properties
- quantifies the cost of programming model constructs or runtime environments
- provides input for performance models
- helps to learn how software interacts with the hardware
One important feature of microbenchmarking is that it is not a black box but a tool to create knowledge and deeper understanding.
Recommended Tools
The difficulty in microbenchmarking is to really measure what you are interested in. Because the things you want to measure are usually very small correct timing is a problem. Also separation of influences may be difficult to guarantee. If, e.g., implementing a microbenchmark in a programming language one must assure that the language does not add overhead that influences the results. Therefore it is usually recommended to use available benchmarks or tools which make it easier to produce meaningful results.
STREAM benchmark
The STREAM benchmark is the industry standard for measuring node-level sustained memory bandwidth. It is a very simple single file implementation of simple streaming loop kernels and should reach peak memory bandwidth on any architecture. Threading is implemented using OpenMP. For meaningful results one has to employ thread affinity control. Measuring main memory bandwidth is the sole purpose of this benchmark.
likwid-bench
likwid-bench is a benchmarking application and a framework to enable rapid prototyping of multi-threaded assembly kernels. Adding a new benchmark amounts to creating a simple text file and recompiling. The framework takes care of threaded execution and pinning, data allocation and placement, time measurement and result presentation. likwid-bench comes with a large collection of architecture specific optimized kernels for various SIMD instruction set extensions. At the moment it is only available for X86 processors on the Linux OS (Arm and Power 9 are in beta).
One main advantage of likwid-bench is that kernels are implemented directly in assembly language ruling out any influence of upper abstraction layers. This allows to accurately measure processor performance properties. likwid-bench can be used for all kinds of bandwidth and instruction throughput measurements. The fine grained control about thread and data placement also allows to measure on-board interconnect bandwidth.
The Bandwidth Benchmark
The Bandwidth Benchmark is a new project with the main focus on providing a teaching benchmark application that also can be the base for own developments. It is heavily inspired by John McCalpin's STREAM benchmark. In contrast to STREAM it has the added benefit that the code is a blueprint for a minimal benchmark application with a generic Makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in other benchmark projects. The benchmark is as STREAM targeted to measure sustained memory bandwidth, but comes with more streaming loop kernel and provides many basic data access patterns from load only to a full triad, including variants with and without write allocate data transfer.
EPCC OpenMP micro-benchmark suite
The EPCC OpenMP micro-benchmark suite are intended to measure the overheads of synchronisation, loop scheduling and array operations in the OpenMP runtime library.
Intel MPI Benchmarks
The Intel MPI Benchmarks perform performance measurements for point-to-point and global communication operations for a range of message sizes. The generated benchmark data characterizes the performance of a cluster system, including node performance, network latency, and throughput efficiency of the MPI implementation used.
DGEMM (Linpack) benchmark
There is a reference Linpack implementation available. Because this is the benchmark used for the TOP500 HPC listing every vendor provides an optimized implementation for their processors. At its core Linpack performs large dense matrix matrix multiplications. Linpack measures the sustained peak floating point instruction throughput for multiply add floating point operations, but also puts some pressure on the memory hierarchy as well as network communication.