Micro benchmarking

Microbenchmarking is about measuring the time or performance of small to very small building blocks of real programs. This can be a common data access pattern, a sequence of operations or even a single instruction.

Introduction

Microbenchmarking is an indispensable tool in performance engineering, which fulfills many purposes. Among other things it :

provides upper performance limits for sustained performance
creates knowledge about performance behavior
helps finding performance bugs in architectures
provides undocumented processor performance properties
quantifies the cost of programming model constructs or runtime environments
provides input for performance models
helps to learn how software interacts with the hardware

One important feature of microbenchmarking is that it is not a black box but a tool to create knowledge and deeper understanding.

Recommended Tools

The difficulty in microbenchmarking is to really measure what you are interested in. Because the things you want to measure are usually very small correct timing is a problem. Also separation of influences may be difficult to guarantee. If, e.g., implementing a microbenchmark in a programming language one must assure that the language does not add overhead that influences the results. Therefore it is usually recommended to use available benchmarks or tools which make it easier to produce meaningful results.

STREAM benchmark

The STREAM benchmark is the industry standard for measuring node-level sustained memory bandwidth. It is a very simple single file implementation of simple streaming loop kernels and should reach peak memory bandwidth on any architecture. Threading is implemented using OpenMP. For meaningful results one has to employ thread affinity control. Measuring main memory bandwidth is the sole purpose of this benchmark.

likwid-bench

likwid-bench is a benchmarking application and a framework to enable rapid prototyping of multi-threaded assembly kernels. Adding a new benchmark amounts to creating a simple text file and recompiling. The framework takes care of threaded execution and pinning, data allocation and placement, time measurement and result presentation. likwid-bench comes with a large collection of architecture specific optimized kernels for various SIMD instruction set extensions. At the moment it is only available for X86 processors on the Linux OS (Arm and Power 9 are in beta).

One main advantage of likwid-bench is that kernels are implemented directly in assembly language ruling out any influence of upper abstraction layers. This allows to accurately measure processor performance properties. likwid-bench can be used for all kinds of bandwidth and instruction throughput measurements. The fine grained control about thread and data placement also allows to measure on-board interconnect bandwidth.

The Bandwidth Benchmark

The Bandwidth Benchmark is a new project with the main focus on providing a teaching benchmark application that also can be the base for own developments. It is heavily inspired by John McCalpin's STREAM benchmark. In contrast to STREAM it has the added benefit that the code is a blueprint for a minimal benchmark application with a generic Makefile and modules for aligned array allocation, accurate timing and affinity settings. Those components can be used standalone in other benchmark projects. The benchmark is as STREAM targeted to measure sustained memory bandwidth, but comes with more streaming loop kernel and provides many basic data access patterns from load only to a full triad, including variants with and without write allocate data transfer.

EPCC OpenMP micro-benchmark suite

The EPCC OpenMP micro-benchmark suite are intended to measure the overheads of synchronisation, loop scheduling and array operations in the OpenMP runtime library.

Intel MPI Benchmarks

The Intel MPI Benchmarks perform performance measurements for point-to-point and global communication operations for a range of message sizes. The generated benchmark data characterizes the performance of a cluster system, including node performance, network latency, and throughput efficiency of the MPI implementation used.

Other MPI micro benchmarks worth looking at are the OSU Micro-Benchmarks and the Sandia MPI Micro-Benchmark Suite (SMB).

DGEMM (Linpack) benchmark

There is a reference Linpack implementation available. Because this is the benchmark used for the TOP500 HPC listing every vendor provides an optimized implementation for their processors. At its core Linpack performs large dense matrix matrix multiplications. Linpack measures the sustained peak floating point instruction throughput for multiply add floating point operations, but also puts some pressure on the memory hierarchy as well as network communication.

IOR Parallel filesystem I/O benchmark

The IOR IO benchmark measure parallel file system I/O performance at both the POSIX and MPI-IO level. It performs writes and reads to/from files under several sets of conditions and reports the resulting throughput rates. mdtest is an additional tool to evaluate the metadata performance of a file system and has been designed to test parallel file system.

Links and further information

Slide set on Microbenchmarking as part of the RRZE Node-level tutorial.