Intel VTune
The Intel VTune™ Amplifier can be used to identify and analyse various aspects in both serial and parallel programs and can be used for both OpenMP and MPI applications. It can be used with a command line interface (CLI) or a graphical user interface (GUI).
Using the Graphical User Interface (GUI)
The graphical interface of the Intel VTune Amplifier XE can usually be started by using the command amplxe-gui
. The following analysis categories are available:
Hotspots
Information on what parts of the code take up most of the runtime and how they may be optimized
- Hotspots
- Memory Access
Microarchitecture
Information on how efficiently the code utilizes the underlying hardware
- Microarchitecture Exploration
- Memory Access
Parallelism
Information on how efficient the parallelization of the code is
- Threading
- HPC Performance Characterization
The following steps lead to a basic optimization of the code. Note that this is only a general approach and that a custom analysis may be more helpful depending on the application.
Step 1: Find the Hotspots
Click on Configure Analysis and select the Hotspot Analysis (usually selected by default). Configure the analysis by specifying a script or executable and setting necessary parameters and environment variables. Click the start button to start the analysis. Try to keep the amount of other software running on the same machine to a minimum to achieve accurate and reproducable results. Once the analysis has finished, the Summary window should open automatically and the results must be interpreted.
The Elapsed Time shows the total runtime of the application including idle times while CPU Time is the sum of the CPU times of all threads. The Paused time is the total time the application was paused using the GUI, CLI commands or user API. The Top Hotspots section shows the most time-consuming functions sorted by CPU time. Below these one can find a histogram displaying the effective CPU Utilization of the application.
In the Bottom-up window one can find detailed information grouped by function and again sorted by CPU time. Functions with high CPU time and large sections of poor CPU utilization should be targeted first. When selecting a function the full stack data will be displayed in the Call Stack section with the following format:
<module>!<function>-<file>:<line>
(the line being the one calling the next function)
To analyse the performance per thread the grouping level can be switched to Thread/Function/Call Stack
. Furthermore, the timeline below displays the behaviour of each thread during execution. Hovering over the Timeline itself will show the elapsed time until that point. The Threads section shows the CPU utilization per thread, while the CPU Utilization section shows the overall CPU utilization of the application.
Clicking on a function in the Bottom-up window (grouped by Thread/Function/Call Stack
) will open the source editor at the respective code line. The Assembly window can be opened instead of or additionally to the Source window by ticking the respective boxes in the toolbar. The columns can be reorganized by drag-and-drop and the changes will be saved and used in other projects.
Step 2: Find Hardware usage bottlenecks
Step 3: Optimize memory access performance
Step 4: Final check
Using the Command Line Interface (CLI)
The Intel VTune Command line interface (CLI) can be started by using the command amplxe-cl
with the following parameters:
parameters | |
<-action> |
usually collect or report
|
[-action-option] |
modify an actions behavior |
[-global-option] |
modify the global settings |
<target> |
set the target application |
[target-options] |
additional parameters or input required by the target application |
action: collect
Syntax: amplxe-cl -collect <analysis_type>
(-collect may be shortened to -c)
The following analysis types are a selection that is of most interest for HPC applications. The full list can be found in the official Intel VTune Amplifier User Guide.
analysis types | |
hotspots |
Identify hotspots and collect stacks and call tree information |
advanced-hotspots |
Identify hotspots by using hardware counters and ignore stack and call tree |
general-exploration |
find low-level hardware issues |
memory-access |
find memory access related issues and estimate bandwith |
Additionally, there is a large number of global modifiers available of which a small selection can be found below aswell.
global modifiers | |
-[no]auto-finalize |
[do not] finalize the result after collection |
-data-limit |
limit the amount of data that may be collected (defualt 1GB) |
-quiet |
display less information |
-search-dir |
set the path where the binary and symbol files are stored |
-result-dir |
set the path where the result should be stored |
Example:
Running amplxe-cl -collect hotspots <target>
with no further options set results in the following output. Similarly to the GUI output one receives the overall elapsed and idle times as well as the CPU times of the individual functions in descending order (list of hotspots). The utilization of the CPUs is also analyzed and judged.
action: report
Syntax: amplxe-cl -report <report_type> [-report-option]
(-report may be shortened to -r)
The report types and options which are of most interest for HPC applications are:
report types | |
summary |
Identify hotspots and collect stacks and call tree information |
hotspots |
Identify hotspots by using hardware counters and ignore stack and call tree |
hw-events |
find low-level hardware issues |
report options | |
-column |
Include or exclude columns |
filter |
Include or exclude data |
group-by |
specify grouping |
time-filter |
specify a time range |
-source-search-dir |
set the path where the source code is stored |
-result-dir |
set the path where the result should be stored |
Example:
First, run a hotspot collection and store the results in a directory (amplxe-cl -collect hotspots -r result <target>
)
Next, run amplxe-cl -report hotspots -r result
to receive a report which contains only the data of interest (set by the report-options), e.g. only specific columns like CPU time.
References
Tutorials by Intel [1]
Intel VTune™ Amplifier Performance Analysis Cookbook [2]