Difference between revisions of "Benchmarking"

From HPC Wiki
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
Benchmarking in HPC is the practice of measuring and comparing the performance of computer systems.
+
[[Category:HPC-Admin]]<nowiki />
This page collects Benchmarking practices for HPC, with a focus on procurement.
+
[[Category:HPC-User]]<nowiki />
 +
[[Category:Benchmarking]]<nowiki />
 +
{{DISPLAYTITLE:Benchmarking}}<nowiki />
  
 +
Benchmarking in HPC is the practice of measuring and comparing the performance of computer systems.
  
<span id="current-common-practices-for-benchmark-selection-in-nhr-datacenter-procurement"></span>
+
Benchmarking of HPC systems and applications is relevant in different contexts:
= Current Common Practices for Benchmark Selection in NHR Datacenter Procurement =
 
  
This section collects the common practices when selecting '''benchmarks''' or '''criteria''' when designing ''Requests for Proposals'' (RFPs) to procure compute hardware for HPC system within '''NHR'''.
+
* HPC users: choosing the most suitable HPC system and application/settings for a given scientific workload
 
+
** [[Benchmarking & Scaling Tutorial]]
These common practices emerged from within the “NHR Future Project” '''Benchmarks and TCO for NHR Procurements'''. They are mainly based on:
+
** [[Scaling]]
 
+
** [[Job efficiency]]
* A '''series of interviews''' conducted between December 2023 and April 2024 among the project partners (NHR@RWTH, NHR@TUDa, NHR@Göttingen, PC2, NHR@KIT, NHR@TUD),
+
** and other article in the category [[:Category:Benchmarking|Benchmarking]]
* a '''questionnaire''' on monitoring of running jobs, and
+
* HPC developers: designing and evaluating algorithms and implementations
* a '''questionnaire''' on TCO modeling.
+
** [[Performance profiling]]
 
+
** [[Performance Engineering]]
We name principles as '''current common practice''' if they are currently practiced among a '''majority''' of the examined NHR centers. They are '''not necessarily superior''' to other practices. If no majority emerges in some aspect, several practiced approaches are listed after ''approaches include''.
+
** [[Benchmarking & Scaling Tutorial]]
 
+
** and other articles in the category [[:Category:Benchmarking|Benchmarking]]
Center-specific components like FPGAs, specific interconnects etc. were not common enough across the examined centers to be considered here.
+
* Procurement of HPC systems: choosing the most suitable offer for an HPC system
 
+
** [[Admin Guide Benchmarking for Procurements]]
<span id="current-common-practice"></span>
+
** and other article in the category [[:Category:Benchmarking|Benchmarking]]
== Current Common Practice ==
 
 
 
* '''Utilize stable procurement teams''' and '''iterate on previous procurements'''
 
** Establish a '''stable team''' for all procurements.
 
** Reuse documents, in particular a list of criteria from previous procurements, and '''iterate''' on them.
 
** Iterations are still allowed to make far-reaching changes.
 
* '''Select a variety of benchmarks'''
 
** Include both '''synthetic''' and '''application benchmarks'''.
 
** Allocate at least one benchmark for performance of pure '''compute''' (e.g. HPL) and pure '''memory''' (e.g. STREAM) each.
 
* '''Use simple benchmarks'''. Used benchmarks should…
 
** …be well understood by both '''supplier''' and '''procuring center'''.
 
** …be '''easy''' to run.
 
** …be '''portable'''.
 
** …yield '''predictable''' and '''reproducible''' results.
 
** …'''represent''' the jobs running in a center.
 
** …'''not overlap''' in their purpose. I.e. use only one benchmark for raw memory throughput, one for peak FLOPs etc.
 
* '''Test Scalability'''
 
** There is '''no general approach''' to ensure scalability.
 
** In general, demand scalability for a '''low number of nodes''' (at most a dozen).
 
** Approaches include:
 
*** Demand the highest performance across a '''fixed number of nodes'''.
 
*** Demand a fixed '''minimum performance''' across a supplier-selectable number of nodes.
 
** Test scalability across all levels: core, socket, node, cluster
 
* '''Derive scores from individual benchmarks'''
 
** Use '''runtime''' or '''throughput''' for scoring.
 
** Scale the score awarded for one benchmark using a '''known reference system''' or '''the maximum across all offers'''.
 
*** Do '''not''' use minimum as 0 points, as bad offers could modify the scoring for all.
 
* '''Join individual scores into a combined score'''
 
** Apply weights to individual benchmarks to derive final score.
 
** Use the '''sum across all individual (weighted) scores''' as combined score.
 
** More complicated methods (weighted arithmetic mean, geometric mean) are not common.
 
* '''Weigh benchmarks against each other'''
 
** Assign weight through maximum achievable score per benchmark. More important benchmarks have a higher maximum.
 
** There is '''no algorithm''' to derive weights.
 
** Approaches include:
 
*** Distribute maximum achievable scores using a '''hierarchical schema''': e.g. 30% application benchmarks, 70% synthetic benchmarks which is divided into 40% SPEC and 30% pure kernel benchmarks, which is divided into 10% compute bound and 20% memory bound etc.
 
*** Translate maximum achievable points to '''fraction of procurement volume''' to grasp the value of a benchmark. A result might be “We are willing to pay xxx € for a high HPL performance.”
 
*** Scale weights of '''application benchmarks''' proportional to their '''usage''' in the cluster. (Note: This not applicable to synthetic benchmarks.)
 
*** Avoid favoring architectures over another through benchmark weights.
 
 
 
<span id="emerging-trends"></span>
 
== Emerging Trends ==
 
 
 
This Section covers aspects that are '''not''' practiced by a '''majority''' of participating partners, but (1) are still practiced by several partners, and, in the eyes of the authors, (2) are candidates for good practice.
 
 
 
* Carefully guide user interaction
 
** Do '''not''' let users directly influence procurement criteria.
 
** Raw user jobs are typically not fit for usage as benchmarks, as they are not easy to run, their performance not well understood etc.
 
** Although rely on users as domain experts (mostly for application benchmarks) when considering flags like <code>-ffastmath</code>, how to verify results etc.
 
* Benchmarks are typically not on a full-system scale.
 
* The home filesystem is typically already present (and not part of a procurement).
 
* There is no common way to procure storage.
 
** Option A: Parallel filesystem is procured together with compute resources.
 
*** rationale: Keep it simple.
 
** Option B: Parallel filesystem is procured in a fully separate procurement.
 
*** rationale: Storage is easily neglected in favor of compute in combined procurements. Pure storage procurements also invite pure storage vendors to participate, and thus strengthen competition.
 
* Heavily rely on SPEC suites instead of various smaller benchmarks.
 
** rationale: Procure a flexible system suited for many workloads.
 
* Emphasize competition: Design benchmark weights in such a way, that no vendor is at an explicit advantage.
 
* Make the life of suppliers easy:
 
** Create a structure for the list of criteria.
 
** Note the approx. time for each benchmark in the heading.
 
** The total effort for all criteria should be 1-2 days (max!!)
 
*** Long-running, non-interactive benchmarks not included
 
** Suppliers are interested in motivation behind criteria
 
*** rationale: Propose alternative architectures
 
* Avoid (non-synthetic) applications for scoring
 
** Need a lot of effort from both center and supplier
 
** possibly use applications for validation (“A-Kriterium”)
 
** rather: Benchmark performance-relevant components individually
 
 
 
<span id="adjustment-to-center-specific-running-job-mixes"></span>
 
== Adjustment to Center-Specific Running Job-Mixes ==
 
 
 
Every NHR center has a different set of jobs that are executed on their clusters. Here, we refer to that as a ''running job mix''. Initially, the research question was: '''How do NHR centers design RFPs towards their running job mixes?'''
 
 
 
The interviews showed that (1) '''there is no common process''' to incorporate users’ needs withing NHR, and (2) that among all partners, there is '''no full algorithm to create a full RFP''' from running job mixes. Nonetheless, all partners are '''very familiar''' with their respective running job mixes. However, deriving criteria/benchmarks for an RFP based on the running job mixes is difficult: The ''current common practice'' outlined above remains a rather '''rigid framework''', into which requirements have to be integrated.
 
 
 
'''No consensus''' emerged on the concept to use user jobs/applications directly as criteria for an RFP. Some partners do this and include multiple applications as benchmarks in RFPs. They base the weights on, e.g., the (accumulated) runtime of the specific applications in the cluster: There, more runtime on the cluster (during production) results in greater weights in an RFP. Still, they limit themselves to a '''small number of (application) benchmarks''' to keep the RFP concise and effort low. Some partners rejected this approach, as they deem their running job mix as '''too heterogenous'''. They claim that selecting a limited set of jobs for an RFP '''would severly misrepresent''' the entirety of the running job mix.
 
 
 
Designing and weighting the remaining criteria is '''similarly non-systematic'''. While some partners apply algorithmically derived/metric-based weighting for application benchmarks, all other criteria (e.g. synthetic benchmarks) must be weighted through other means. In a few cases partners can rely on metrics from their current cluster in production, e.g. for the required memory per core. For most criteria however, no metric offers suitable insights: The '''remaining criteria''' have to be '''weighted manually'''. Also, even if weights have been derived algorithmically, defining their share of the total RFP must be done by hand. The '''final balancing''' of (groups of) criteria against each other remains a '''manual task''' for all partners.
 
 
 
<span id="benchmark-usage-in-hpc-procurements"></span>
 
= Benchmark Usage in HPC Procurements =
 
 
 
Overview:
 
 
 
* Benchmarks that HPC centers are actively using in procurements (“active use” means that there is some experience in that benchmark and it must not have been used in the latest procurement)
 
* Results come from 6 NHR centers.
 
* Last updated: April 2024
 
 
 
Legend:
 
 
 
* ''yes'': benchmark is actively in use and results in a score (“B-Kriterium”)
 
* ''threshold only'': benchmark is actively in use, but only minimum performance demanded (“A-Kriterium”)
 
* ''no'': benchmark is not in active use: holds if neither yes nor threshold-only
 
 
 
Remarks:
 
 
 
* If benchmarks cannot be assigned to “only” CPU-centric or “only” GPU-centric nodes (e.g., IO benchmarks), the best fit was selected. (This could also mean: both).
 
* Storage-only procurements are not covered here.
 
 
 
<span id="cpu-centric-nodes-number-of-nhr-centers-with-active-usage"></span>
 
== CPU-centric Nodes: Number of NHR centers with active usage ==
 
 
 
{| class="wikitable"
 
|-
 
! Benchmark
 
! style="text-align: right;"| yes
 
! style="text-align: right;"| threshold-only
 
! Comments
 
|-
 
| HPL
 
| style="text-align: right;"| 5
 
| style="text-align: right;"| 1
 
|
 
|-
 
| HPCG
 
| style="text-align: right;"| 4
 
| style="text-align: right;"| 0
 
|
 
|-
 
| ior
 
| style="text-align: right;"| 4
 
| style="text-align: right;"| 0
 
|
 
|-
 
| SPEC CPU
 
| style="text-align: right;"| 3
 
| style="text-align: right;"| 0
 
|
 
|-
 
| IO500 mdtest
 
| style="text-align: right;"| 3
 
| style="text-align: right;"| 0
 
|
 
|-
 
| Stream
 
| style="text-align: right;"| 2
 
| style="text-align: right;"| 1
 
|
 
|-
 
| GROMACS
 
| style="text-align: right;"| 2
 
| style="text-align: right;"| 0
 
|
 
|-
 
| OSU (latency and bandwidth)
 
| style="text-align: right;"| 2
 
| style="text-align: right;"| 0
 
|
 
|-
 
| IO500 (except mdtest)
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 1
 
|
 
|-
 
| OpenFOAM
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| ICON
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| SPEChpc (MPI)
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| m-AIA
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
| [https://www.coe-raise.eu/reference-codes Link]
 
|-
 
| VASP CuC-VdW
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| CP2K H20-512
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| QuantomEspresso GRIR443
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| DGEMM
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| IMB (MPI latency)
 
| style="text-align: right;"| 0
 
| style="text-align: right;"| 1
 
|
 
|}
 
 
 
<span id="gpu-centric-nodes-number-of-nhr-centers-with-active-usage"></span>
 
== GPU-centric Nodes: Number of NHR centers with active usage ==
 
 
 
{| class="wikitable"
 
|-
 
! Benchmark
 
! style="text-align: right;"| yes
 
! style="text-align: right;"| threshold-only
 
! Comments
 
|-
 
| HPL
 
| style="text-align: right;"| 5
 
| style="text-align: right;"| 0
 
|
 
|-
 
| HPCG
 
| style="text-align: right;"| 3
 
| style="text-align: right;"| 0
 
|
 
|-
 
| ior
 
| style="text-align: right;"| 3
 
| style="text-align: right;"| 1
 
|
 
|-
 
| openfarbric perftest (ib_*)
 
| style="text-align: right;"| 3
 
| style="text-align: right;"| 0
 
|
 
|-
 
| OSU (latency and bandwidth)
 
| style="text-align: right;"| 2
 
| style="text-align: right;"| 0
 
|
 
|-
 
| StreamGPU
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 1
 
| e.g. [https://github.com/UoB-HPC/BabelStream BabelStream]
 
|-
 
| GROMACS
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| IO500
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| PyTorch BERT-large
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| DeepSpeed DNN (BERT)
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
| [https://github.com/tud-zih-ki/DeepSpeedExamples/tree/gpu_bench/training/bing_bert Link]
 
|-
 
| DeepSpeed allreaduce
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| MLPerf Image Classification
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| MLPerf Speech Recognition
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| SPEChpc (TGT or ACC)
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| SPECaccel
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| MPT-30B
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| LAMMPS ReaxFF
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|-
 
| DGEMM
 
| style="text-align: right;"| 1
 
| style="text-align: right;"| 0
 
|
 
|}
 

Latest revision as of 13:07, 19 July 2024

Benchmarking in HPC is the practice of measuring and comparing the performance of computer systems.

Benchmarking of HPC systems and applications is relevant in different contexts: