Intel Advisor is a vectorization and threading optimization tool suite comprised of the following 3 tools:
- Vectorization Advisor
- Threading Advisor
- Flow Graph Analyzer
The Intel Advisor can be used either with a graphical user interface (GUI) or a command line interface (CLI). The first can be started with
advixe-gui and is recommended for most analyses. The latter can be used for example to visualize results on macOS machines or to automate Intel Advisor workflows. The syntax is
advixe-cl <-action> [-project-dir PATH] [-action-options] [-global-options] [[--] target [target options]].
In order to achieve reliable and reproducable results the application to be optimized should be built in release mode with high optimization and additional settings depending on the tool and type of analysis. These settings can be found in the official documentation by Intel.
The Vectorization Advisor offers a number of analyses of which a short overview can be found below.
Survey Hotspots Analysis
The Survey Hotspots Analysis can be used to find out where (improved) vectorization could be most beneficial, especially concerning loops and is usually the first analysis to run. This includes compiler report data on if and why auto-vectorization was possible or not. The great advantage of this type of analysis is a low runtime overhead. The downside is a potential lack of data to find all potential problems and improvements. After succesfully running the analysis one receives a Survey Report which includes compiler report data. Here, various configurations and filters are available to limit the amount of displayed data.
Trip Counts Analysis
The Trip Counts Analysis checks the number of times loops are executed. It also highlights trip counts which are either too short or not a multiple of the vector length. The collected data is added to the survey report created by the Survey Hotspots analysis. This may of course increase the report generation time. However, there are several techniques available to minimize data collection, result size and execution time.
The FLOP Analysis can be used to measure floating-point and integer operations and memory traffic. The collected data is again added to the Survey Report and offers additional information on the application's memory usage and performance to decide on a vectorization strategy.
The Roofline Analysis visualizes the actual performance of the application compared to the maximum possible performance concerning hardware-based limitations, e.g. compute capacity or memory bandwidth. The Cache-Aware Roofline Model (CARM) also offers self data capability which considers only the data, FLOP operations and runtime of single loops or functions. The self data especially ignores inner loops and functions, of which data can be collected only by using the total data. This enables gathering information on whether or how a loop or function works differently if called from varying sources, leading to design inefficiencies which may have be causing several other problems as well.
Intel Advisor offers two Refinement Analyses, the Dependencies analysis and the Memory Access Patterns (MAP) analysis. The first can be used to find potential data dependencies while the latter checks for various memory access issues which both may block auto-vectorization of certain parts of the code by the compiler.
The Threading Advisor mainly consists of the Suitability Analysis and Suitability Report. It can be used to measure the applications actual and potential maximum parallel performance, giving an estimation of how likely performance gains from improved parallelism are. This includes the impact from parallel overhead, the number of iterations of a loop and the duration of one iteration. The main goal of this analysis is to discover the parts of the code which can be noticably improved by parallelization while avoiding wasting time on parallelizing other regions which have less to none impact on the overall performance. The downside of this analysis is a rather large overhead and increase in runtime during data collection as well as large data result sizes.
Flow Graph Analyzer
Graph Parallelism is enabled by using the Intel Threading Building Blocks (Intel TBB) library. A Dependency Flow Graph can be used to display hold-ups and find potential deadlocks or single points of failure. A Message Flow Graph displays the messages travelling between nodes during execution. Excessive copying of values may be optimized by passing pointers instead of the actual values.