This page collects Benchmarking practices for HPC, with a focus on procurement.

Current Common Practices for Benchmark Selection in NHR Datacenter Procurement

This section collects the common practices when selecting benchmarks or criteria when designing Requests for Proposals (RFPs) to procure compute hardware for HPC system within NHR.

These common practices emerged from within the “NHR Future Project” Benchmarks and TCO for NHR Procurements. They are mainly based on:

A series of interviews conducted between December 2023 and April 2024 among the project partners (NHR@RWTH, NHR@TUDa, NHR@Göttingen, PC2, NHR@KIT, NHR@TUD),
a questionnaire on monitoring of running jobs, and
a questionnaire on TCO modeling.

We name principles as current common practice if they are currently practiced among a majority of the examined NHR centers. They are not necessarily superior to other practices. If no majority emerges in some aspect, several practiced approaches are listed after approaches include.

Center-specific components like FPGAs, specific interconnects etc. were not common enough across the examined centers to be considered here.

Current Common Practice

Utilize stable procurement teams and iterate on previous procurements
- Establish a stable team for all procurements.
- Reuse documents, in particular a list of criteria from previous procurements, and iterate on them.
- Iterations are still allowed to make far-reaching changes.
Select a variety of benchmarks
- Include both synthetic and application benchmarks.
- Allocate at least one benchmark for performance of pure compute (e.g. HPL) and pure memory (e.g. STREAM) each.
Use simple benchmarks. Used benchmarks should…
- …be well understood by both supplier and procuring center.
- …be easy to run.
- …be portable.
- …yield predictable and reproducible results.
- …represent the jobs running in a center.
- …not overlap in their purpose. I.e. use only one benchmark for raw memory throughput, one for peak FLOPs etc.
Test Scalability
- There is no general approach to ensure scalability.
- In general, demand scalability for a low number of nodes (at most a dozen).
- Approaches include:
  - Demand the highest performance across a fixed number of nodes.
  - Demand a fixed minimum performance across a supplier-selectable number of nodes.
- Test scalability across all levels: core, socket, node, cluster
Derive scores from individual benchmarks
- Use runtime or throughput for scoring.
- Scale the score awarded for one benchmark using a known reference system or the maximum across all offers.
  - Do not use minimum as 0 points, as bad offers could modify the scoring for all.
Join individual scores into a combined score
- Apply weights to individual benchmarks to derive final score.
- Use the sum across all individual (weighted) scores as combined score.
- More complicated methods (weighted arithmetic mean, geometric mean) are not common.
Weigh benchmarks against each other
- Assign weight through maximum achievable score per benchmark. More important benchmarks have a higher maximum.
- There is no algorithm to derive weights.
- Approaches include:
  - Distribute maximum achievable scores using a hierarchical schema: e.g. 30% application benchmarks, 70% synthetic benchmarks which is divided into 40% SPEC and 30% pure kernel benchmarks, which is divided into 10% compute bound and 20% memory bound etc.
  - Translate maximum achievable points to fraction of procurement volume to grasp the value of a benchmark. A result might be “We are willing to pay xxx € for a high HPL performance.”
  - Scale weights of application benchmarks proportional to their usage in the cluster. (Note: This not applicable to synthetic benchmarks.)
  - Avoid favoring architectures over another through benchmark weights.

Emerging Trends

This Section covers aspects that are not practiced by a majority of participating partners, but (1) are still practiced by several partners, and, in the eyes of the authors, (2) are candidates for good practice.

Carefully guide user interaction
- Do not let users directly influence procurement criteria.
- Raw user jobs are typically not fit for usage as benchmarks, as they are not easy to run, their performance not well understood etc.
- Although rely on users as domain experts (mostly for application benchmarks) when considering flags like -ffastmath, how to verify results etc.
Benchmarks are typically not on a full-system scale.
The home filesystem is typically already present (and not part of a procurement).
There is no common way to procure storage.
- Option A: Parallel filesystem is procured together with compute resources.
  - rationale: Keep it simple.
- Option B: Parallel filesystem is procured in a fully separate procurement.
  - rationale: Storage is easily neglected in favor of compute in combined procurements. Pure storage procurements also invite pure storage vendors to participate, and thus strengthen competition.
Heavily rely on SPEC suites instead of various smaller benchmarks.
- rationale: Procure a flexible system suited for many workloads.
Emphasize competition: Design benchmark weights in such a way, that no vendor is at an explicit advantage.
Make the life of suppliers easy:
- Create a structure for the list of criteria.
- Note the approx. time for each benchmark in the heading.
- The total effort for all criteria should be 1-2 days (max!!)
  - Long-running, non-interactive benchmarks not included
- Suppliers are interested in motivation behind criteria
  - rationale: Propose alternative architectures
Avoid (non-synthetic) applications for scoring
- Need a lot of effort from both center and supplier
- possibly use applications for validation (“A-Kriterium”)
- rather: Benchmark performance-relevant components individually

Adjustment to Center-Specific Running Job-Mixes

Every NHR center has a different set of jobs that are executed on their clusters. Here, we refer to that as a running job mix. Initially, the research question was: How do NHR centers design RFPs towards their running job mixes?

The interviews showed that (1) there is no common process to incorporate users’ needs withing NHR, and (2) that among all partners, there is no full algorithm to create a full RFP from running job mixes. Nonetheless, all partners are very familiar with their respective running job mixes. However, deriving criteria/benchmarks for an RFP based on the running job mixes is difficult: The current common practice outlined above remains a rather rigid framework, into which requirements have to be integrated.

No consensus emerged on the concept to use user jobs/applications directly as criteria for an RFP. Some partners do this and include multiple applications as benchmarks in RFPs. They base the weights on, e.g., the (accumulated) runtime of the specific applications in the cluster: There, more runtime on the cluster (during production) results in greater weights in an RFP. Still, they limit themselves to a small number of (application) benchmarks to keep the RFP concise and effort low. Some partners rejected this approach, as they deem their running job mix as too heterogenous. They claim that selecting a limited set of jobs for an RFP would severly misrepresent the entirety of the running job mix.

Designing and weighting the remaining criteria is similarly non-systematic. While some partners apply algorithmically derived/metric-based weighting for application benchmarks, all other criteria (e.g. synthetic benchmarks) must be weighted through other means. In a few cases partners can rely on metrics from their current cluster in production, e.g. for the required memory per core. For most criteria however, no metric offers suitable insights: The remaining criteria have to be weighted manually. Also, even if weights have been derived algorithmically, defining their share of the total RFP must be done by hand. The final balancing of (groups of) criteria against each other remains a manual task for all partners.

Benchmark Usage in HPC Procurements

Overview:

Benchmarks that HPC centers are actively using in procurements (“active use” means that there is some experience in that benchmark and it must not have been used in the latest procurement)
Results come from 6 NHR centers.
Last updated: April 2024

Legend:

yes: benchmark is actively in use and results in a score (“B-Kriterium”)
threshold only: benchmark is actively in use, but only minimum performance demanded (“A-Kriterium”)
no: benchmark is not in active use: holds if neither yes nor threshold-only

Remarks:

If benchmarks cannot be assigned to “only” CPU-centric or “only” GPU-centric nodes (e.g., IO benchmarks), the best fit was selected. (This could also mean: both).
Storage-only procurements are not covered here.

CPU-centric Nodes: Number of NHR centers with active usage

Benchmark	yes	threshold-only	Comments
HPL	5	1
HPCG	4	0
ior	4	0
SPEC CPU	3	0
IO500 mdtest	3	0
Stream	2	1
GROMACS	2	0
OSU (latency and bandwidth)	2	0
IO500 (except mdtest)	1	1
OpenFOAM	1	0
ICON	1	0
SPEChpc (MPI)	1	0
m-AIA	1	0	Link
VASP CuC-VdW	1	0
CP2K H20-512	1	0
QuantomEspresso GRIR443	1	0
DGEMM	1	0
IMB (MPI latency)	0	1

GPU-centric Nodes: Number of NHR centers with active usage

Benchmark	yes	threshold-only	Comments
HPL	5	0
HPCG	3	0
ior	3	1
openfarbric perftest (ib_*)	3	0
OSU (latency and bandwidth)	2	0
StreamGPU	1	1	e.g. BabelStream
GROMACS	1	0
IO500	1	0
PyTorch BERT-large	1	0
DeepSpeed DNN (BERT)	1	0	Link
DeepSpeed allreaduce	1	0
MLPerf Image Classification	1	0
MLPerf Speech Recognition	1	0
SPEChpc (TGT or ACC)	1	0
SPECaccel	1	0
MPT-30B	1	0
LAMMPS ReaxFF	1	0
DGEMM	1	0

Benchmarking for Procurements

Contents