

# Faster Code.... Faster

## Intel<sup>®</sup> Parallel Studio XE 2017

Dr.-Ing. Michael Klemm Software and Services Group michael.klemm@intel.com

Unleash the Beast...

## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



**Cluster Tools** 

## Intel<sup>®</sup> Parallel Studio XE

Intel<sup>®</sup> Inspector Memory & Threading Checking

#### Intel<sup>®</sup> VTune<sup>™</sup> Amplifier Performance Profiler

Performance

\_ibraries

Profiling, Analysis &

Architecture

Intel<sup>®</sup> Data Analytics Acceleration Library **Optimized for Data Analytics & Machine Learning** 

## Intel<sup>®</sup> Math Kernel Library

Optimized Routines for Science, Engineering & Financial

#### Intel<sup>®</sup> MPI Library

Intel<sup>®</sup> Advisor

**Vectorization Optimization & Thread Prototyping** 

Intel<sup>®</sup> Cluster Checker **Cluster Diagnostic Expert System** 

Intel<sup>®</sup> Trace Analyzer & Collector

**MPI Profiler** 

Intel<sup>®</sup> Integrated Performance Primitives Image, Signal & Compression Routines

#### Intel<sup>®</sup> Threading Building Blocks Task Based Parallel C++ Template Library

Intel<sup>®</sup> C/C++ & Fortran Compilers

Intel<sup>®</sup> Distribution for Python

Performance Scripting - Coming Soon - Q3'16

#### **Optimization Notice**



# INTEL<sup>®</sup> COMPILERS

## Intel<sup>®</sup> Compilers for Parallel Studio XE 2017

What's new in Intel<sup>®</sup> C++ 17.0 and Intel<sup>®</sup> Fortran 17.0

Productive language-level vectorization & parallelism models for advanced developers driving application performance

#### Common updates

- Enhanced support for the newest AVX2 and AVX512 instruction sets for the latest Intel<sup>®</sup> processors (including Intel<sup>®</sup> Xeon Phi)
- Enhanced optimization/vectorization reports register allocation
- Tight integration with Intel<sup>®</sup> Advisor
- Initial support for OpenMP\* 4.5, offering improved vectorization control, new SIMD instructions, and much more

#### Intel<sup>®</sup> C++ Compiler

- SIMD Data Layout Template to facilitate vectorization for your C++ code
- Virtual function vectorization capability
- Improved compiler loop and function alignment
- Full support for the latest C11 and C++14 standards

#### Intel<sup>®</sup> Fortran Compiler

- Substantial coarray performance improvement
  - up to **twice as fast** as previous versions on non-trivial coarray Fortran programs
- Almost complete Fortran 2008 support
- Further interoperability with C (part of draft Fortran 2015)

#### Optimization Notice



## **Impressive Performance Improvement**

Intel<sup>®</sup> Compiler OpenMP\* Explicit Vectorization

- Three lines added that take full advantage of both SSE or AVX
- Pragma's ignored by other compilers so code is portable

#pragma omp declare simd linear(z:40) uniform(L, N, Nmat) linear(k)
float path\_calc(float \*z, float L[[VLEN], int k, int N, int Nmat)

```
#pragma omp declare simd uniform(L, N, Nopt, Nmat) linear(k)
float portfolio(float L[][VLEN], int k, int N, int Nopt, int Nmat)
.....
for (path=0; path<NPATH; path+=VLEN) {</pre>
```

```
/* Initialise forward rates */
z = z0 + path * Nmat;
#pragma omp simd linear(z:Nmat)
for(int k=0; k < VLEN; k++) {
for(i=0;i<N;i++) {
</pre>
```

```
L[i][k] = LO[i];
}
```

/\* LIBOR path calculation \*/
float temp = path\_calc(z, L, k, N, Nmat);
v[k+path] = portfolio(L, k, N, Nopt, Nmat);

```
/* move pointer to start of next block */
z += Nmat;
```

#### Libor calculation speedup

Normalized performance data – higher is better



Configuration: Intel<sup>®</sup> Xeon<sup>®</sup> CPU E3-1270@ 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 10MB, L3 Cache 8.0MB, 64-bit Windows<sup>+</sup> Server 2012 R2 Datacenter. Compiler options; SSE4.2: -O3 -Qopenmp -simd -QxSSE4.2 or AVX2: -O3 -QOPEND -simd -QXSE4.2 or AVX2: -O3 -QOPEND -simd -QXX2: -O3 -QOPEND -simd -QXX2: -O3 -QOPEND -simd

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. \* Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.

#### Optimization Notice

## Impressive performance improvement Intel C++ Explicit Vectorization using OpenMP\* SIMD

SIMD Speedup on Intel® Xeon® Processor

Normalized performance data - higher is better



Serial SSE4.2 Core-AVX2

Configuration: Intel<sup>®</sup> Xeon<sup>®</sup> CPU E3-1270 @ 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows\* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: -03-Qopenmp-simd-QxSSE4.2 or AVX2: -03-Qopenmp-simd-QxCORE-AVX2. For more information go to http://www.intel.com/performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. \* Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.

#### **Optimization Notice**



# **INTEL SOFTWARE ANALYSIS TOOLS**

Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE Performance Profiler

Intel® Advisor XE Vectorization Optimization and Thread Prototyping

# **INTEL® VTUNE™ AMPLIFIER XE** Performance profiler

Optimization Notice Copyright © 2016, Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.



## Intel<sup>®</sup> VTune<sup>™</sup> Amplifier

### Faster, Scaleable Code, Faster

### Get the Data You Need

- Hotspot (Statistical call tree), Call counts (Statistical)
- Thread Profiling Concurrency and Lock & Waits Analysis
- Cache miss, Bandwidth analysis...<sup>1</sup>
- GPU Offload and OpenCL<sup>™</sup> Kernel Tracing

### **Find Answers Fast**

- View Results on the Source / Assembly
- OpenMP Scalability Analysis, Graphical Frame Analysis
- Filter Out Extraneous Data Organize Data with Viewpoints
- Visualize Thread & Task Activity on the Timeline

### Easy to Use

- No Special Compiles C, C++, C#, Fortran, Java, ASM
- Visual Studio\* Integration or Stand Alone
- Graphical Interface & Command Line
- Local & Remote Data Collection
- Analyze Windows\* & Linux\* data on OS X\*2

 $^{\rm 1}$  Events vary by processor.  $^{\rm 2}$  No data collection on OS X\*

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### **Quickly Find Tuning Opportunities**

|                                                                                                           | CPU Time 🛩 🛠 🐼                                | 1   |
|-----------------------------------------------------------------------------------------------------------|-----------------------------------------------|-----|
| Function / Call Stack                                                                                     | Effective Time by Utilization 🔊 Spin Overhead |     |
|                                                                                                           | Idle Poor Ok Ideal Over Time Time             |     |
| ■ FireObject::checkCollision                                                                              | 4.507s 0s 0s                                  | s   |
| Intersection State FireCollisions Bange                                                                   | 3.444s 0s 0:                                  | 5   |
|                                                                                                           | Os 3.406s 0                                   | s   |
| std::basic_ifstream <char,struct std::char_traits<="" td=""><td>3.359s 0s 0s</td><td>s</td></char,struct> | 3.359s 0s 0s                                  | s   |
|                                                                                                           | 3.359s Os Os                                  | s   |
| CBaseDevice::Present                                                                                      | 2.335s 0.671s 0.                              | s   |
| Selected 1 row(s):                                                                                        | 1.151s 0.728s 0                               | 5 1 |

#### See Results On The Source Code

| Source         | Assembly 📰 📰 🐼 🐼 🐼 🔕                                    | Assembly grouping: Address     |
|----------------|---------------------------------------------------------|--------------------------------|
| Source<br>Line | Source                                                  | CPU Time: Total by Utilization |
| 81             | <pre>for (int i = 0; i &lt; mem_array_i_max; i++)</pre> | 0.300s                         |
| 82             | {                                                       |                                |
| 83             | <pre>for (int j = 0; j &lt; mem_array_j_max; j++)</pre> | 4.936s                         |
| 84             | {                                                       |                                |
| 85             | mem array [j*mem array j max+i] = *fill val             | 7.207s                         |

#### **Tune OpenMP Scalability**



#### Visualize & Filter Data



1



## **Profile Python & Go!** And Mixed Python / C++ / Fortran

Low Overhead Sampling

- Accurate performance data without high overhead instrumentation
- Launch application or attach to a running process

## Precise Line Level Details

No guessing, see source line level detail
 Mixed Python / native C, C++, Fortran...

-GO

Optimize native code driven by Python

| 💹 Basi         | <b>c Hotspots</b> Hotspots by CPU Usage viewpoint ( <u>ch</u> | ange) ?                           |     | <b>INTEL VTUNE AMPLIFIER XE 2017</b>     |
|----------------|---------------------------------------------------------------|-----------------------------------|-----|------------------------------------------|
| ⊲ 🔛 c          | ollection Log 🛛 🕀 Analysis Target 🖄 Analysis Type 📓 Sumn      | nary 🗳 Bottom-up 🗳 Caller/Callee  | 🖧 Т | op-down Tree 🛛 🔁 Platform 🔹 core.c 🕨     |
| Source         | Assembly                                                      | ssembly grouping: Address         | ~   | CPU Time 🗸                               |
|                |                                                               | CPU Time: Total                   | ^   | Viewing ↓ 1 of 1 ▷ selected stack(s)     |
| Source<br>Line | Source                                                        | Effective Time by Utilization     |     | 100.0% (3.388s of 3.388s)                |
|                |                                                               | 🔲 Idle 📕 Poor 📋 Ok 📕 Ideal 📕 Over |     | core.pyd!_pyx f 4core 12SlowpokeCore     |
| 10             | <pre>def doLog():</pre>                                       |                                   |     | core.pyd! pyx pf 4core 12SlowpokeCore    |
| 11             | <pre>template, objects = makeParams()</pre>                   |                                   |     | python27.dll!func@0x1e10f700+0x299 - [u  |
| 12             | for in xrange(1000):                                          |                                   |     | main.py <u>doLog</u> +0x30 - main.py:13  |
| 13             | <pre>&gt; logging.info(template.format(*objects))</pre>       | 86.7%                             |     | python27.dll!func@0x1e10fbc0+0x383 - [un |
| 1/             |                                                               |                                   |     | main.py! <u>main</u> +0x18 - main.py:18  |

#### **Optimization Notice**



## Three Keys to HPC Performance:

## Threading, Memory Access, Vectorization – Intel VTune™ Amplifier

### Threading: CPU Utilization

- Serial vs. Parallel time
- Top OpenMP regions by potential gain
- Tip: Use hotspot OpenMP region analysis for more detail

### Memory Access Efficiency

- Stalls by memory hierarchy
- Bandwidth utilization
- Tip: Use Memory Access analysis
- Vectorization: FPU Utilization
- FLOPS<sup>†</sup> estimates from sampling
- Tip: Use Intel Advisor for precise metrics and vectorization optimization



<sup>†</sup> For 3rd, 5th, 6th Generation Intel<sup>®</sup> Core<sup>™</sup> processors and second generation Intel<sup>®</sup> Xeon Phi<sup>™</sup> processor code named Knights Landing.

#### **Optimization Notice**



## **Optimize Memory Access**

Memory Access Analysis - Intel<sup>®</sup> VTune<sup>™</sup> Amplifier 2017

Tune data structures for performance

- Attribute cache misses to data structures (not just the code causing the miss)
- Support for custom memory allocators

Optimize NUMA latency & scalability

- True & false sharing optimization
- Auto detect max system bandwidth
- Easier tuning of inter-socket bandwidth

Easier install, latest processors

- No special drivers required on Linux\*
- Intel<sup>®</sup> Xeon Phi<sup>™</sup> processor MCDRAM (high bandwidth memory) analysis



**Optimization Notice** 



## Storage Device Analysis (HDD, SATA or NVMe SSD)

Intel<sup>®</sup> VTune<sup>™</sup> Amplifier

### Are You I/O Bound or CPU Bound?

- Explore imbalance between I/O operations (async & sync) and compute
- Storage accesses mapped to the source code
- See when CPU is waiting for I/O
- Measure bus bandwidth to storage

### Latency analysis

- Tune storage accesses with latency histogram
- Distribution of I/O over multiple devices

#### **Disk Input and Output Histogram**



**Optimization Notice** 

## Intel<sup>®</sup> Performance Snapshots

Three Fast Ways to Discover Untapped Performance

Is your application making good use of modern computer hardware?

- Run a test case during your coffee break.
- High level summary shows which apps can benefit most from code modernization and faster storage.

### Pick a Performance Snapshot:

- Application for non-MPI apps
- MPI for MPI apps
- Storage for systems. Servers and workstations with directly attached storage.

#### Free download: <u>http://www.intel.com/performance-snapshot</u> Also included with Intel® Parallel Studio and Intel® VTune™ Amplifier products.

#### **Optimization Notice**









### **MPI Performance Snapshot**



**TOP 5 MPI functions** 

**GFLOPS** 

(intel)

#### Free download: <u>http://www.intel.com/performance-snapshot</u>. Also included with Intel® Parallel Studio Cluster Edition.

**Optimization Notice** 

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the p<u>roperty of others.</u> el

# **INTEL® ADVISOR XE** Vectorization optimization and thread prototyping For software architects

Optimization Notice



## Get Faster Code Faster! Intel® Advisor

### Vectorization Optimization

### Have you:

- Recompiled for AVX2 with little gain
- Wondered where to vectorize?
- Recoded intrinsics for new arch.?
- Struggled with compiler reports?

## Data Driven Vectorization: 汎

- What vectorization will pay off most?
- What's blocking vectorization? Why?
- Are my loops vector friendly?
- Will reorganizing data increase performance?
- Is it safe to just use pragma simd?

| 🚇 Where should I add            | d ve | ctorization and/c           | r threa                   | ding pa    | aralleli | sm? 🗖           | Ir                    | ntel Ac          | lvisor XE 20        | 016 |
|---------------------------------|------|-----------------------------|---------------------------|------------|----------|-----------------|-----------------------|------------------|---------------------|-----|
| 🤗 Summary 🛭 😂 Survey Repo       | ort  | 🍅 Refinement Reports        | 🍐 Anno                    | tation Rep | oort 🦞   | Suitability Rep | port                  |                  |                     |     |
| Elapsed time: 54.44s Vector     | ized | Not Vectorized ි            | FIL                       | TER: All M | Modules  | ✓ All           | Sources 🗸 🗸           |                  |                     | Q,  |
| Function Call Sites and Loop    | ۵    | ▲ @ Vector Issues           |                           | Total      | Trip 🔊   | Loop Type       | Why No Vectorization? | Vectorized Loops |                     | ^   |
| Function Call Sites and Loop    | w    | W vector issues             | Self<br>Time <del>▼</del> | Time       | Counts   | LOOD Type       | why no vectorization: | Vecto            | Efficiency          |     |
| i> 🖱 [loop at stl_algo.h:4740 i |      |                             | 0.170s l                  | 0.170s l   |          | Scalar          | 🧧 non-vectorizable l  | 11.              |                     |     |
| 🖃 🛄 [loop at loopstl.cpp:2449   |      |                             | 0.170s l                  | 0.170s l   | 12; 4    | <u>Collapse</u> | <u>Collapse</u>       | AVX              | ~100 <mark>%</mark> |     |
| 🗈 🐸 [loop at loopstl.cpp:2      |      |                             | 0.150s l                  | 0.150s l   | 12       | Vectorized (B   |                       | AVX              |                     |     |
| i> 🝊 [loop at loopstl.cpp:2     |      |                             | 0.020s I                  | 0.020s I   | 4        | Remainder       |                       |                  |                     |     |
| ±> 🝊 [loop at loopstl.cpp:7900, |      |                             | 0.170s l                  | 0.170s l   | 500      | Scalar          | vectorization possi   |                  |                     |     |
| 🗄 ⊍ [loop at loopstl.cpp:35     |      | 💡 <u>1</u> High vector regi | 0.160s                    | 0.160s     | 12       | Expand          | Expand                | AVX              | ~6 <mark>9%</mark>  | ~   |
| <                               |      |                             |                           |            |          |                 |                       |                  |                     |     |

"Intel<sup>®</sup> Advisor's Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable."

Gilles Civario Senior Software Architect Irish Centre for High-End Computing

#### **Optimization Notice**



## Faster Code Faster with Data Driven Design

Intel® Advisor - Vectorization Optimization and Thread Prototyping

### Faster Vectorization Optimization:

- Vectorize where it will pay off most
- Quickly ID what is blocking vectorization
- Tips for effective vectorization
- Safely force compiler vectorization
- Optimize memory stride

### Breakthrough for Threading Design:

- Quickly prototype multiple options
- Project scaling on larger systems
- Find synchronization errors before implementing threading
- Design without disrupting development

#### Less Effort, Less Risk and More Impact

| 📕 Where should I ad             | d ve | ctorization and/c            | or threa      | ding pa       | aralleli                   | sm? 🗖           | , Ir                 | ntel Ad  | dvisor XE 20       | 16       |
|---------------------------------|------|------------------------------|---------------|---------------|----------------------------|-----------------|----------------------|----------|--------------------|----------|
| 🤗 Summary 🛛 😂 Survey Rep        | ort  | 🍅 Refinement Reports         | 🍐 Anno        | otation Rep   | oort 🦞                     | Suitability Re  | oort                 |          |                    |          |
| Elapsed time: 54.44s Vector     | ized | Not Vectorized 0             | FIL           | TER: All I    | Modules                    | ✓ AII           | Sources 🗸 🗸          |          |                    | ্        |
| 5 x 0.000 U                     |      |                              | Self<br>Time▼ | Total<br>Time | Trip <sup></sup><br>Counts |                 |                      | Vectoriz | ed Loops           | ^        |
| unction Call Sites and Loop     | ð    |                              |               |               |                            | Loop Type       | Why No Vectorization | Vecto    | Efficiency         |          |
| ₃> 🖱 [loop at stl_algo.h:4740 i |      |                              | 0.170s l      | 0.170s I      |                            | Scalar          | non-vectorizable I   |          |                    |          |
| 🖃 🛄 [loop at loopstl.cpp:2449   |      |                              | 0.170s l      | 0.170s1       | 12; 4                      | <u>Collapse</u> | <u>Collapse</u>      | AVX      | ~100%              |          |
| i>🖲 [loop at loopstl.cpp:2      |      |                              | 0.150s l      | 0.150s l      | 12                         | Vectorized (B   |                      | AVX      |                    |          |
| i> 🖱 [loop at loopstl.cpp:2     |      |                              | 0.020s1       | 0.020s I      | 4                          | Remainder       |                      |          |                    |          |
| i> 🖱 [loop at loopstl.cpp:7900, |      |                              | 0.170s l      | 0.170s l      | 500                        | Scalar          | vectorization possi  |          |                    |          |
| 🗄 ⊍ [loop at loopstl.cpp:35     |      | 💡 <u>1</u> High vector regi: | 0.160s        | 0.160s        | 12                         | Expand          | Expand               | AVX      | ~6 <mark>9%</mark> | <u> </u> |
| <                               |      |                              |               |               |                            |                 |                      |          |                    |          |



Part of Intel® Parallel Studio for Windows\* and

#### http://intel.ly/advisor-xe

#### **Optimization Notice**



## Next Gen Intel® Xeon Phi™ Support

### Vectorization Advisor runs on and optimizes for Intel® Xeon Phi

| 1                    |                                        |                            |                                                                          |                                                                        |                                                          | Vectorized           | Loops                     |                                    |       | Instruction Set Analysis     |            | AVX-512 ERI – specific to                      |
|----------------------|----------------------------------------|----------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------|----------------------|---------------------------|------------------------------------|-------|------------------------------|------------|------------------------------------------------|
| Loops                |                                        | ۵                          | Vector Issues                                                            | Self Time▼                                                             | Loop Туре                                                | Vector ISA           | Efficiency                | Gain Esti                          | 1     |                              | Data Type: |                                                |
| <b>=0</b> [loop      | P                                      |                            | 8 <u>3</u> Possible i                                                    | . 35.226s 5.4 <b>%</b>                                                 | Vectorized+Threaded (Body; Peeled; Re                    | . AVX512             | <mark>~28</mark> 5        | 2.21x                              | 8     | Divisions; FMA; Gathers      | Float32; . |                                                |
| <b>⊠</b> [lo         | oc                                     |                            | 8 2 Possible in                                                          | . 26.025s 4.0%                                                         | Vectorized (Body)+Threaded (OpenMP)                      | AVX512               |                           |                                    | 8     | Divisions; Gathers; FMA      | Float32;   | 256/512 AV                                     |
| ⊇ <mark>⊍</mark> [lo | 00                                     |                            | ♀ <u>1</u> High vecto                                                    | . 5.876s 🔜                                                             | Vectorized (Peeled)+Threaded (OpenMP)                    | AVX512               |                           |                                    | 8     | Divisions; Gathers; FMA      | Float32;   | . 256/512 AVX2: AVX512ER_512: AVX512 Masked Lc |
| ⊇ <mark>⊍</mark> [lo | oc                                     |                            | ♀ <u>1</u> High vecto                                                    | . 3.324s                                                               | Vectorized (Remainder)+Threaded (Open                    | AVX512               |                           |                                    | 8     | Divisions; Gathers; FMA      | Float32;   | . 256/512 AVX2; AVX512ER_512; AVX512 Masked Lc |
| <b>⊞</b> [loop       | 0                                      |                            |                                                                          | 34.599s 5.3%                                                           | Vectorized (Body; Remainder)                             | AVX512               | ~70%                      | 5.64x                              | 8     | Divisions; FMA; Square Roots | Float32;   | 256/51 AVX2; AVX512ER_512; AVX512 Masked Lc    |
| <b>⊞</b> [loop       | 5                                      |                            | 🛿 <u>1</u> Possible in                                                   | . 33.849s 5.2%                                                         | Vectorized (Body; Peeled; Remainder)                     | AVX512               | <mark>~28%</mark>         | 2.24x                              | 8     | Divisions; FMA; Gathers      | Float32;   | 256/512 AVX; AVX2; AVX512ER_512; AV Masked Lc  |
| <b>⊞</b> ⊡[loop      | 0                                      |                            |                                                                          | 19.839s 3.1%                                                           | Vectorized (Body; Remainder)                             | AVX512               | 72%                       | 11.48x                             | 16; 8 |                              | Float32;   | 256/51 AVX2; AVX512F_512 Masked Lc             |
| Issue:               | ficient memory ac                      | cient<br>cess              | t memory access<br>patterns may result                                   | patterns present                                                       | ler Diagnostic Details                                   | y the compiler       | . Improve performance     | by investigatinç                   | ,     |                              | Le         | (72%), Speed-up (11.5x),<br>ngth (16)          |
|                      |                                        |                            |                                                                          |                                                                        | . To confirm: Run a <u>Memory Access Patterns analy</u>  | <u>/sis</u> .        |                           |                                    |       |                              | -          |                                                |
| Issue:               | Ineffective pee                        | eled/r                     | remainder loop(s                                                         | ) present                                                              |                                                          |                      |                           |                                    |       | Perform                      | anc        | e optimization problem and                     |
| All o                | or some <u>source lo</u>               | op iter                    | rations are not exec                                                     | uting in the <u>loop body</u> . Improve                                | performance by moving source loop iterations from        | n <u>peeled/rema</u> | inder loops to the loop b | iody.                              |       | a dutica la                  |            |                                                |
|                      |                                        |                            | : Collect trip cou                                                       |                                                                        |                                                          |                      |                           |                                    | _     | advice h                     | OW         |                                                |
| LL -                 | , ,                                    |                            |                                                                          | that might generate more prec                                          | ise recommendations. To fix: Run a <u>Trip Counts an</u> | ialysis.             |                           |                                    |       |                              |            |                                                |
|                      | The <u>trip count</u> is<br>• Increase | ation:<br>not a<br>the siz | : Add data paddi<br>multiple of <u>vector le</u><br>ze of objects and ac | ength. To fix: Do one of the fol<br>Ad iterations so the trip count is | -                                                        |                      | Elaps                     | <b>gram</b><br>sed Tim<br>or Instr | e: 14 |                              | , AVX5:    | 12, SSE, SSE2 Number of CPU Threads: 4         |

#### Loop metrics

| Total CPU time                     | 454.08s | 100.0% |
|------------------------------------|---------|--------|
| Time in <b>88</b> vectorized loops | 41.86s  | 9.2%   |

#### **Optimization Notice**

Windows\* OS

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

/Qopt-assume-safe-padding -gopt-assume-safe-padding

Linux\* OS



## Start Tuning for AVX-512 without AVX-512 hardware

Intel<sup>®</sup> Advisor - Vectorization Advisor

Use -axCOMMON-AVX512 -xAVX compiler flags to generate both code-paths

- AVX(2) code path (executed on Haswell and earlier processors)
- AVX-512 code path for newer hardware

Compare AVX and AVX-512 code with Intel Advisor

|                                          |   | Self Time |                                       | Vectorized | Loops              |        |      | *           | Instruction Set Analysis | s 🔣     |          | Advanced         |                  |
|------------------------------------------|---|-----------|---------------------------------------|------------|--------------------|--------|------|-------------|--------------------------|---------|----------|------------------|------------------|
| Loops                                    | œ | Self Time | Loop Туре                             | Vect 🔺     | Efficiency         | Gain   | VL ( | Compiler Es | Traits                   | Data T  | Vector W | Instruction Sets | Vectorization De |
| ──── [loop in s352_ at loopstl.cpp:5939] |   | 0,641s I  | Vectorized (Body)                     | AVX2       | ~ <mark>54%</mark> | 2,15x  | 4    | 2,15x       | FMA; Inserts             | Float32 | 128      | AVX; FMA         |                  |
| 되 [loop in s352_ at loopstl.cpp:5939]    |   | n/a       | Remainder [Not Executed]              |            |                    |        | 4    |             | FMA                      |         |          |                  |                  |
| ≥ 🖲 [loop in s352_ at loopstl.cpp:5939]  |   | 0,641s I  | Vectorized (Body)                     | AVX2       |                    |        | 4    | 2,15x       | Inserts; FMA             | Ins     | erts     | (AVX2)           | VS.              |
| 🗵 🖯 [loop in s352_ at loopstl.cpp:5939]  |   | n/a       | Vectorized (Body) [Not Executed]      | AVX512     |                    |        | 16   | 3,20x       | Gathers; FMA             |         |          |                  |                  |
| 뇌  [loop in s352_ at loopstl.cpp:5939]   |   | n/a       | Vectorized (Remainder) [Not Executed] | AVX512     |                    |        | 16   | 2,70x       | Gathers; FMA             | Gat     | thers    | s (AVX-          | 512)             |
| □ 🕘 [loop in s125A\$omp\$parallel_for@   |   | 0,496s I  | Vectorized Versions                   | AVX2       | ~100%              | 13,54x | 8    | <13,54x     | FMA; NT-stores           |         |          |                  | <u> </u>         |
| 고 🗇 [loop in s125A\$omp\$parallel_for    |   | n/a       | Peeled [Not Executed]                 |            |                    |        | 8    |             | FMA                      |         |          |                  |                  |
| 되는 [loop in s125A\$omp\$parallel_for     |   | n/a       | Remainder [Not Executed]              |            |                    |        | 8    |             | FMA SC                   | beec    | -up      | estima           | te:              |
| ☑ [loop in s125A\$omp\$parallel_for      |   | 0,465s I  | Vectorized (Body)                     | AVX2       |                    |        | 8    | 13,54x      |                          |         |          |                  |                  |
| 되는 [loop in s125 .Z\$omp\$parallel for   |   | n/a       | Vectorized (Peeled) [Not Executed]    | AVX512     |                    |        | 16   | 6,77x       | FMA 3                    | 8.5x    | (AV)     | <2) vs.          |                  |
| 되는 [loop in s125Z\$omp\$parallel_for     |   | n/a       | Vectorized (Body) [Not Executed]      | AVX512     |                    |        | 32   | 30,61x      | NT                       |         |          | -                |                  |
| ᠑ [loop in s125Z\$omp\$parallel_for      |   | n/a       | Vectorized (Remainder) [Not Executed] | AVX512     |                    |        | 16   | 9,78x       | FMA 30                   | ).6x    | (AV)     | (-512)           |                  |

### (intel)

#### Optimization Notice



## **Precise Repeatable FLOPS Metrics**

Intel<sup>®</sup> Advisor – Vectorization Optimization

- FLOPS by loop and function
- All recent Intel processors (not co-processors)

- Instrumentation (count FLOP) plus sampling (time with low overhead)
- Adjusted for masking with AVX-512 processors

|                                                     |         |        |         |        |                    |         | ITEL ADVISOR 2017      |  |  |  |  |
|-----------------------------------------------------|---------|--------|---------|--------|--------------------|---------|------------------------|--|--|--|--|
|                                                     | FLOPS   | FLOPS  |         |        |                    |         |                        |  |  |  |  |
| + - Function Call Sites and Loops                   | GFLOPS  | AI     | L1 GB/s | GFLOP  | FLOP Per Iteration | L1 GB   | L1 Bytes Per Iteration |  |  |  |  |
| 🛛 🕗 [loop in matvec at Multiply.c:69]               | 0.8260  | 0.1633 | 5.0586  | 3.0720 | 32                 | 18.8160 | 196                    |  |  |  |  |
| ≥🗾 [loop in matvec at Multiply.c:60]                | 0.912 0 | 0.1633 | 5.5853  | 3.0720 | 32                 | 18.8160 | 196                    |  |  |  |  |
| ☑ <sup>(5</sup> ] [loop in matvec at Multiply.c:69] | 1.248 0 | 0.2500 | 4.9920  | 1.3440 | 4                  | 5.3760  | 16                     |  |  |  |  |
| ☑ <sup>(5</sup> ] [loop in matvec at Multiply.c:60] | 1.592 🛛 | 0.2500 | 6.3699  | 1.3440 | 4                  | 5.3760  | 16                     |  |  |  |  |
|                                                     | 3.055 🔲 | 0.2500 | 12.2205 | 0.0960 | 16                 | 0.3840  | 64                     |  |  |  |  |
| ± <sup>[]</sup> [loop in matvec at Multiply.c:60]   | 6.282   | 0.2500 | 25.1279 | 0.0960 | 16                 | 0.3840  | 64                     |  |  |  |  |

#### **Optimization Notice**

Copyright © 2016, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



27



## Enhanced Memory Access Analysis

Are you bandwidth or compute limited?

### **Measure Footprint**

 Compare to cache size Does it fit in cache?

Variable References

 Map data to variable names for easier analysis

Gather/Scatter

 Detect unneeded gather/scatters that reduce performance

| Site Lo                                             | ocation    |                       |                                       | Loop-Carrie       | d Dependencies   | Strides Distribution            | <b>1</b> 📥                   | Access Pattern |   | Max. Site Footpri | nt ^ |
|-----------------------------------------------------|------------|-----------------------|---------------------------------------|-------------------|------------------|---------------------------------|------------------------------|----------------|---|-------------------|------|
| 🖱 [loc                                              | op in s41  | 117_ at loo           | pstl.cpp:76.                          | . No informat     | tion available   | 50% / 50% / 0                   | 50% / 50% / 0% Mixed strides |                |   | 192B              |      |
| 🍊 [loop in s442_ at loopstl.cpp:6815] No informatio |            |                       |                                       |                   | tion available   | 56% / 0% / 44                   | %                            | Mixed strides  |   | 256B              |      |
| 🖰 [log                                              | op in s2   | 72_ at loop           | stl.cpp:3447                          | ] No informat     | tion available   | 60% / 0% / 40                   | %                            | Mixed strides  |   | 320B              | y    |
|                                                     | -          | ess Pattern<br>Stride |                                       | Dependencie       |                  | ommendations<br>Nested Function | Verial                       | ole references |   | ccess Footprint   | M    |
| D                                                   | •          | Stride                | Type                                  |                   | Jource           | Nested Function                 | variac                       | le references  | A | cccss rootprint   | IVI  |
| ID<br>P2                                            | ₩.         | Stride                | Gather stric                          | de                | loopstl.cpp:3450 | Nested Function                 | a, c, d                      |                |   | 20B               |      |
| ■ P2                                                | 448<br>449 |                       | Gather stric<br>(e [i] >              | = *t)             | loopstl.cpp:3450 | Nested Function                 |                              |                |   |                   | lcc  |
| ■ P2                                                | 448        | if<br>{               | Gather stric<br>(e[i_] >=<br>a[i_] += | = *t)<br>= c_[i_] |                  | Nested Function                 |                              |                |   |                   |      |



#### **Optimization Notice**



Speaker – the speaker notes are important for this presentation. Be sure to read them.

# WHICH TOOL SHOULD I BE USING?

## **Optimizing Performance On Parallel Hardware**

It's an iterative process...





## Performance Analysis Tools for Diagnosis

Intel<sup>®</sup> Parallel Studio XE



#### **Optimization Notice**



# Tools for High Performance Implementation



#### Optimization Notice

