Difference between revisions of "Oracle Sampling Collector and Performance Analyzer"

From HPC Wiki
Jump to navigation Jump to search
(Created page with "The Oracle Sampling Collector and the Performance analyzer are a pair of tools that can be used to collect and analyze performance data for serial or parallel applications. Th...")
 
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[Category:HPC-Developer]]
 
The Oracle Sampling Collector and the Performance analyzer are a pair of tools that can be used to collect and analyze performance data for serial or parallel applications.
 
The Oracle Sampling Collector and the Performance analyzer are a pair of tools that can be used to collect and analyze performance data for serial or parallel applications.
 
Therefor the Sampling Collector gathers performance data by sampling at regular time intervals and by tracing function calls.  
 
Therefor the Sampling Collector gathers performance data by sampling at regular time intervals and by tracing function calls.  
Line 19: Line 20:
 
$ collect -h
 
$ collect -h
 
</syntaxhighlight>
 
</syntaxhighlight>
Most of the times it is hardly possible to use more than 4 counters in the same measurement since some counters use the same resources and thus conflict with each other.
+
The available counters depends on the CPU (hardware) type of your node, with the same things being referenced by different names on different CPUs, sometimes. Number of counters which can be measured in a single application run is also dependent on the CPU type. Most of the times it is hardly possible to use more than '''4''' counters in the same measurement since some counters use the same resources and thus conflict with each other.
 
The most important collect options are listed in the table below:
 
The most important collect options are listed in the table below:
 
{| class="wikitable"
 
{| class="wikitable"
Line 53: Line 54:
 
! Counter  || Description
 
! Counter  || Description
 
|-  
 
|-  
| <code>cycles,on</code> <br /> <code>insts,on</code> || cycle count <br /> instruction count
+
| <code>cycles,on,insts,on</code> || Cycle count and instruction count. The quotient is the CPI rate (clocks per instruction). <br /> The MHz rate of the CPU multiplied with the instruction count divided by the cycle count gives the MIPS rate.
 
|-
 
|-
 
| <code>l3h,on,l3m,on</code> <br /> <code>l2h,on,l2m,on</code> <br /> <code>dch,on,dcm,on</code> || L3 cache hits and misses <br /> L2 cache hits and misses <br /> L1 data-cache hits and misses  
 
| <code>l3h,on,l3m,on</code> <br /> <code>l2h,on,l2m,on</code> <br /> <code>dch,on,dcm,on</code> || L3 cache hits and misses <br /> L2 cache hits and misses <br /> L1 data-cache hits and misses  
 
|-
 
|-
| <code>fp_arith_inst_retired.128b_packed_single</code> <br /> <code>fp_arith_inst_retired.128b_packed_double</code> <br /> <code>fp_arith_inst_retired.256b_packed_single</code> <br /> <code>fp_arith_inst_retired.256b_packed_double</code> || number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired, each count represents 4 computations <br /> number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired, each count represents 2 computations <br /> number of SSE/AVX computational 256-bit packed single precision floating-point instructions retired, each count represents 8 computations <br /> number of SSE/AVX computational 256-bit packed double precision floating-point instructions retired, each count represents 4 computations
+
| <code> cycles,on,dtlbm,on</code> || A high rate of DTLB misses indicates an unpleasant memory access pattern of the program. Large pages might help.
 +
|-
 +
| <code>fp_arith_inst_retired.128b_packed_single</code> <br /> <code>fp_arith_inst_retired.128b_packed_double</code> <br /> <code>fp_arith_inst_retired.256b_packed_single</code> <br /> <code>fp_arith_inst_retired.256b_packed_double</code> || no. of SSE/AVX computational 128-bit packed single precision floating-point instructions retired (count = 4 computations) <br /> no. of SSE/AVX computational 128-bit packed double precision floating-point instructions retired (count = 2 computations) <br /> no. of SSE/AVX computational 256-bit packed single precision floating-point instructions retired (count = 8 computations) <br /> no. of SSE/AVX computational 256-bit packed double precision floating-point instructions retired (count = 4 computations)
 
|}
 
|}
  
Line 86: Line 89:
 
Hence, it is recommended to start the program with as few processes as possible.
 
Hence, it is recommended to start the program with as few processes as possible.
  
== The Oracle performance Analyzer ==
+
== The Oracle Performance Analyzer ==
  
 
After experiment data (e.g. '''test.1.er''') has been obtained by the Sampling Collector it can be evaluated using the Oracle Performance Analyzer as follows:
 
After experiment data (e.g. '''test.1.er''') has been obtained by the Sampling Collector it can be evaluated using the Oracle Performance Analyzer as follows:
Line 95: Line 98:
 
[[File:Oracle-performance-analyzer-call-tree.PNG|1000px]] <br />
 
[[File:Oracle-performance-analyzer-call-tree.PNG|1000px]] <br />
 
This example program just calculates <math>\pi</math> by numerical integration of the function '''f''' with <math>f(x) = \frac{4.0}{1.0 + x^2}</math> from <math>x = 0</math> to <math>x = 1</math>.
 
This example program just calculates <math>\pi</math> by numerical integration of the function '''f''' with <math>f(x) = \frac{4.0}{1.0 + x^2}</math> from <math>x = 0</math> to <math>x = 1</math>.
 +
 +
== Site-specific notes==
 +
=== {{RWTH}} ===
 +
On the RWTH Cluster the Oracle Performance Analyzer must be loaded using the module system.
 +
 +
The Sampling Collector and the Performance Analyzer are part of the Oracle Developer Studio. To get an overview about the installed versions type:
 +
<syntaxhighlight lang="sh">
 +
$ module apropos studio
 +
$ module avail studio
 +
</syntaxhighlight>
 +
Finally, you can load the Oracle Developer Studio by:
 +
<syntaxhighlight lang="sh">
 +
$ module load studio[/<version>]
 +
</syntaxhighlight>
 +
 +
== References ==
 +
[https://docs.oracle.com/cd/E77782_01/html/E77798/index.html Oracle Performance Analyzer Documentation]

Latest revision as of 09:12, 24 June 2020

The Oracle Sampling Collector and the Performance analyzer are a pair of tools that can be used to collect and analyze performance data for serial or parallel applications. Therefor the Sampling Collector gathers performance data by sampling at regular time intervals and by tracing function calls. These information is gathered in so-called experiment files, which can then be displayed by the Performance Analyzer.

The Oracle Sampling Collector

In order to collect performance data it is first recommended to compile the program with debug information using the -g flag. This ensures source line attribution and full functionality of the analyzer. Performance data can then be gathered by linking the program as usual and running it under the control of the Sampling Collector with the command:

$ collect a.out

Profile data is gathered every 10 milliseconds and written to the experiment file test.1.er by default. The filename number is automatically incremented on subsequent experiments. In fact the experiment file is an entire directory with a lot of information. In order to manipulate these it is recommended to use the provided utility commands er_mv, er_rm, er_cp to move, remove or copy these directories. This makes sure that time stamps are preserved, for example.

Options

Many different kinds of performance data can be gathered by specifying the right options. To get a list of all available hardware counters just invoke the following command:

$ collect -h

The available counters depends on the CPU (hardware) type of your node, with the same things being referenced by different names on different CPUs, sometimes. Number of counters which can be measured in a single application run is also dependent on the CPU type. Most of the times it is hardly possible to use more than 4 counters in the same measurement since some counters use the same resources and thus conflict with each other. The most important collect options are listed in the table below:

Option Description
-p on | off | hi | lo Clock profiling ('hi' needs to be supported on the system)
-H on | off Heap tracing
-m on | off MPI tracing
-h counter0,on,... Hardware Counters
-j on | off Java profiling
-S on | off | seconds Periodic sampling (default interval: 1 sec)
-o experimentfile Output file
-d directory Output directory
-g experimentgroup Output file group
-L size Output file size limit
-F on | off Follows descendant processes
-C comment Puts comments in the notes file for the experiment

Some hardware counters available on a Skylake CPU that might be useful are listed in the table below:

Counter Description
cycles,on,insts,on Cycle count and instruction count. The quotient is the CPI rate (clocks per instruction).
The MHz rate of the CPU multiplied with the instruction count divided by the cycle count gives the MIPS rate.
l3h,on,l3m,on
l2h,on,l2m,on
dch,on,dcm,on
L3 cache hits and misses
L2 cache hits and misses
L1 data-cache hits and misses
cycles,on,dtlbm,on A high rate of DTLB misses indicates an unpleasant memory access pattern of the program. Large pages might help.
fp_arith_inst_retired.128b_packed_single
fp_arith_inst_retired.128b_packed_double
fp_arith_inst_retired.256b_packed_single
fp_arith_inst_retired.256b_packed_double
no. of SSE/AVX computational 128-bit packed single precision floating-point instructions retired (count = 4 computations)
no. of SSE/AVX computational 128-bit packed double precision floating-point instructions retired (count = 2 computations)
no. of SSE/AVX computational 256-bit packed single precision floating-point instructions retired (count = 8 computations)
no. of SSE/AVX computational 256-bit packed double precision floating-point instructions retired (count = 4 computations)

The retired floating-point instructions can be used to calculate the FLOPS rate as follows:
Let fp_arith_inst_retired.128b_packed_single, fp_arith_inst_retired.128b_packed_double, fp_arith_inst_retired.256b_packed_single and fp_arith_inst_retired.256b_packed_double.
Then the number of floating-point operations per time is given by
.

Sampling of MPI programs

MPI programs can be sampled in two different ways:

  • by wrapping the MPI binary
  • by wrapping the mpiexec

In the first case an example command to start the sampling process could look like this:

$ mpiexec <opt> collect <opt> a.out <opt>

Each MPI process writes its data into its own experiment directory test.*.er.
In the second case the sampling process can be started as follows:

$ collect <opt> -M <MPI> mpiexec <opt> a.out <opt>

All sampling data will be stored in a single "founder" experiment with "subexperiments" for each MPI process.

Note that running collect with a large numer of MPI processes may result in an overwhelming amount of experiment data. Hence, it is recommended to start the program with as few processes as possible.

The Oracle Performance Analyzer

After experiment data (e.g. test.1.er) has been obtained by the Sampling Collector it can be evaluated using the Oracle Performance Analyzer as follows:

$ analyzer test.1.er

The GUI offers different views. One of them is the call tree which shows in which order function calls happened during the execution of the program. An example call tree is shown in the image below:
Oracle-performance-analyzer-call-tree.PNG
This example program just calculates by numerical integration of the function f with from to .

Site-specific notes

RWTH Aachen University

On the RWTH Cluster the Oracle Performance Analyzer must be loaded using the module system.

The Sampling Collector and the Performance Analyzer are part of the Oracle Developer Studio. To get an overview about the installed versions type:

$ module apropos studio
$ module avail studio

Finally, you can load the Oracle Developer Studio by:

$ module load studio[/<version>]

References

Oracle Performance Analyzer Documentation