

# Xeon and Xeon Phi



# Xeon (Broadwell)

### Intel "Tick-Tock" Roadmap – Part I

| Intel <sup>®</sup> Core <sup>™</sup><br>MicroArchitecture |                                   | Micro Architecture<br>Codename "Nehalem" |                                          | 2 <sup>nd</sup> Generation<br>Intel <sup>®</sup> Core <sup>™</sup><br>Micro Architecture<br>Architecture |                                          |
|-----------------------------------------------------------|-----------------------------------|------------------------------------------|------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------|
| Merom                                                     | Penryn                            | Nehalem                                  | Westmere                                 | Sandy Bridge                                                                                             | Ivy Bridge                               |
| NEW<br>Micro architecture<br>65nm                         | NEW<br>Process Technology<br>45nm | NEW<br>Micro architecture<br><b>45nm</b> | NEW<br>Process Technology<br><b>32nm</b> | NEW<br>Micro architecture<br><b>32nm</b>                                                                 | NEW<br>Process Technology<br><b>22nm</b> |
| ТОСК                                                      | ТІСК                              | ТОСК                                     | ТІСК                                     | ТОСК                                                                                                     | ТІСК                                     |
| 2006<br>SSSE-3                                            | 2007<br>SSE4.1                    | 2008<br>SSE4.2                           | 2009<br>AES                              | 2011<br>AVX                                                                                              | 2012<br>RDRAND<br>etc                    |



### Intel "Tick-Tock" Roadmap – Part II

Future Release Dates & Features subject to Change without Notice !



AVX-2

## Current Intel<sup>®</sup> Xeon Platform - Broadwell

### <u>Xeon</u>

Latest released – Broadwell (14nm process)

- Intel's Foundation of HPC Performance
- Up to 22 cores, Hyperthreading
- ~66 GB/s stream memory BW (4 ch. DDR4 2400)
- AVX2 256-bit (4 DP, 8 SP flops) -> >0.7 TFLOPS
- 40 PCIe lanes





### nte **XEON** inside"

### Intel<sup>®</sup> Xeon<sup>®</sup> Processors

| Feature                                 | Xeon E5-2600 v3<br>(Haswell-EP, 22nm)                            | Xeon E5-2600 v4<br>(Broadwell-EP, 14nm) |
|-----------------------------------------|------------------------------------------------------------------|-----------------------------------------|
| Cores Per Socket                        | Up to 18                                                         | Up to 22                                |
| Threads Per Socket                      | Up to 36 threads Up to 44 threads                                |                                         |
| Last-level Cache (LLC)                  | Up to 45 MB                                                      | Up to 55 MB                             |
| QPI Speed (GT/s)                        | 2x QPI 1.1 channels 6.4, 8.0, 9.6 GT/s                           |                                         |
| PCIe* Lanes/<br>Controllers/Speed(GT/s) | 40 / 10 / PCle* 3.0 (2.5, 5, 8 GT/s)                             |                                         |
| Memory Population                       | 4 channels of up to 3<br>RDIMMs or 3 LRDIMMs                     | + 3DS LRDIMM <sup>&amp;</sup>           |
| Max Memory Speed Up to 2133             |                                                                  | Up to 2400                              |
| TDP (W)                                 | DP (W) 160 (Workstation only), 145, 135, 120, 105, 90, 85, 65, 5 |                                         |

#### Core Single Thread IPC Performance





### **On-Chip Interconnect Architecture**



(intel)

- 7

### Broadwell/Haswell Core Pipeline





### Haswell/Broadwell Buffer Sizes

### Extract more parallelism in every generation

|                       | Nehalem   | Sandy Bridge | Haswell |
|-----------------------|-----------|--------------|---------|
| Out-of-order Window   | 128       | 168          | 192     |
| In-flight Loads       | 48        | 64           | 72      |
| In-flight Stores      | 32        | 36           | 42      |
| Scheduler Entries     | 36        | 54           | 60      |
| Integer Register File | N/A       | 160          | 168     |
| FP Register File      | N/A       | 144          | 168     |
| Allocation Queue      | 28/thread | 28/thread    | 56      |



Intel<sup>®</sup> Microarchitecture (Haswell); Intel<sup>®</sup> Microarchitecture (Nehalem); Intel<sup>®</sup> Microarchitecture (Sandy Bridge)

### **Haswell and Broadwell Core Microarchitecture**



### Intel<sup>®</sup> Xeon<sup>®</sup> Processor E5 v4 Family: Core Improvements

#### Extract more parallelism in scheduling uops

- Reduced instruction latencies (ADC, CMOV, PCLMULQDQ)
- Larger out-of-order scheduler (60->64 entries)
- New instructions (ADCX/ADOX)

#### Improved performance on large data sets

- Larger L2 TLB (1K->1.5K entries)
- New L2 TLB for 1GB pages (16 entries)
- 2nd TLB page miss handler for parallel page walks

### Improved address prediction for branches and returns

 Increased Branch Prediction Unit Target Array from 8 ways to 10

## Floating Point Instruction performance improvements

- Faster vector floating point multiplier (5 to 3 cycles)
- 1024 Radix divider for reduced latency, increased throughput
- Split Scalar divides for increased parallelism/bandwidth
- Faster vector Gather

### Broadwell: What's new



All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice

### Core Cache Size/Latency/Bandwidth

| Metric               | Nehalem                                            | Sandy Bridge                                      | Haswell                                           |
|----------------------|----------------------------------------------------|---------------------------------------------------|---------------------------------------------------|
| L1 Instruction Cache | 32K, 4-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| L1 Data Cache        | 32K, 8-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| Fastest Load-to-use  | 4 cycles                                           | 4 cycles                                          | 4 cycles                                          |
| Load bandwidth       | 16 Bytes/cycle                                     | 32 Bytes/cycle<br>(banked)                        | 64 Bytes/cycle                                    |
| Store bandwidth      | 16 Bytes/cycle                                     | 16 Bytes/cycle                                    | 32 Bytes/cycle                                    |
| L2 Unified Cache     | 256K, 8-way                                        | 256K, 8-way                                       | 256K, 8-way                                       |
| Fastest load-to-use  | 10 cycles                                          | 11 cycles                                         | 11 cycles                                         |
| Bandwidth to L1      | 32 Bytes/cycle                                     | 32 Bytes/cycle                                    | 64 Bytes/cycle                                    |
| L1 Instruction TLB   | 4K: 128, 4-way<br>2M/4M: 7/thread                  | 4K: 128, 4-way<br>2M/4M: 8/thread                 | 4K: 128, 4-way<br>2M/4M: 8/thread                 |
| L1 Data TLB          | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: fractured | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way |
| L2 Unified TLB       | 4K: 512, 4-way                                     | 4K: 512, 4-way                                    | 4K+2M shared: 1024,<br>8-way                      |



## New Instructions in Haswell/Broadwell

| Group                                |                                                                       | Description                                                                                      | Count * |  |
|--------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|---------|--|
| <-N                                  | SIMD Integer Instructions promoted to 256 bits                        | Adding vector integer operations to 256-bit                                                      |         |  |
| AVX-2                                | Gather Load elements using a vector of indices, vectorization enabler |                                                                                                  | 170/124 |  |
| 4                                    | Shuffling / Data<br>Rearrangement                                     | Blend, element shift and permute instructions                                                    |         |  |
| FMA                                  |                                                                       | Fused Multiply-Add operation forms (FMA-3)                                                       | 96 / 60 |  |
| Bit Manipulation and<br>Cryptography |                                                                       | Improving performance of bit stream manipulation and decode, large integer arithmetic and hashes | 15 / 15 |  |
| TSX=RTM+HLE                          |                                                                       | Transactional Memory                                                                             | 4/4     |  |
| Others                               |                                                                       | MOVBE: Load and Store of Big Endian forms<br>INVPCID: Invalidate processor context ID            |         |  |





# Xeon Phi (Knights Landing)

# **HIGH-LEVEL ARCHITECTURE & INSTRUCTION SET**

### Current Xeon Phi<sup>™</sup> Platform – Knights Landing



### <u>Xeon Phi</u>

Knights Landing (14nm process),

- Optimized for highly parallelized compute intensive workloads
- Common programming model & S/W tools with Xeon processors, enabling efficient app readiness and performance tuning
- up to 72 cores, 490 GB/s stream BW, on-die 2D mesh
- AVX512– 512-bit (8 DP, 16 SP flops) -> >3 TFLOPS
- 36 PCIe lanes

16

### A Paradigm Shift



tel

## Knights Landing (Host or PCIe)





**Groveport Platform** 

#### **Knights Landing Processors**

Host Processor for Groveport Platform Solution for future clusters with both Xeon and Xeon Phi

#### **Knights Landing PCIe Coprocessors**

Ingredient of Grantley & Purley Platforms Solution for general purpose servers and workstations



### PCIe Coprocessor vs. Host Processor





1Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

intel 19

### **KNL Instruction Set**



intel

### Intel<sup>®</sup> Software Development Emulator

- Freely available instruction emulator
  - http://www.intel.com/software/sde
- Emulates existing ISA as well as ISAs for upcoming processors
- Intercepts instructions with Pin; allows functional emulation of existing and upcoming ISAs (including AVX-512).
  - Execution times may be slow, but the result will be correct.
- Record dynamic instruction mix; useful for tuning/assessing vectorization content
- First step: compile for Knights Landing:
  - \$ icpc -xMIC-AVX512 <compiler args>



## Running SDE

- SDE invocation is very simple:
  - \$ sde <sde-opts> -- <binary> <command args>
- By default, SDE will execute the code with the CPUID of the host.
  - The code may run more slowly, but will be functionally equivalent to the target architecture.
  - For Knights Landing, you can specify the -knl option.
  - For Haswell, you can specify the -hsw option.



23

# **KNL MICROARCHITECTURE**

## **KNL Architecture Overview**

#### ISA

Intel® Xeon® Processor Binary-Compatible (w/Broadwell)

#### **On-package memory**

Up to 16GB, ~500 GB/s STREAM at launch

#### **Platform Memory**

Up to 384GB (6ch DDR4-2400 MHz) Fixed Bottlenecks

- ✓ 2D Mesh Architecture
- ✓ Out-of-Order Cores
- ✓ 3x single-thread vs. KNC



x4 DMI2 to PCH 36 Lanes PCIe\* Gen3 (x16, x16, x4)





### **KNL Mesh Interconnect**



#### **Mesh of Rings**

- Every row and column is a (half) ring
- YX routing: Go in Y  $\rightarrow$  Turn  $\rightarrow$  Go in X
- Messages arbitrate at injection and on turn

#### **Cache Coherent Interconnect**

- MESIF protocol (F = Forward)
- Distributed directory to filter snoops

#### **Three Cluster Modes**

(1) All-to-All (2) Quadrant (3) Sub-NUMA Clustering



### Cluster Mode: All-to-All



Address uniformly hashed across all distributed directories

No affinity between Tile, Directory and Memory

Lower performance mode, compared to other modes. Mainly for fall-back

#### Typical Read L2 miss

- 1. L2 miss encountered
- 2. Send request to the distributed directory
- 3. Miss in the directory. Forward to memory
- 4. Memory sends the data to the requestor

### **Cluster Mode: Quadrant**



Chip divided into four virtual Quadrants

Address hashed to a Directory in the same quadrant as the Memory

Affinity between the Directory and Memory

# Lower latency and higher BW than all-to-all. SW Transparent.

1) L2 miss, 2) Directory access, 3) Memory access, 4) Data return

intel

### Cluster Mode: Sub-NUMA Clustering (SNC)



1) L2 miss, 2) Directory access, 3) Memory access, 4) Data return

Each Quadrant (Cluster) exposed as a separate NUMA domain to OS.

Looks analogous to 4-Socket Xeon

Affinity between Tile, Directory and Memory

Local communication. Lowest latency of all modes.

SW needs to NUMA optimize to get benefit.

## KNL Core and VPU

- Out-of-order core w/ 4 SMT threads
- VPU tightly integrated with core pipeline
- 2-wide decode/rename/retire
- 2x 64B load & 1 64B store port for D\$
- L1 prefetcher and L2 prefetcher
- Fast unaligned and cache-line split support
- Fast gather/scatter support



## **KNL Hardware Threading**

4 threads per core SMT

Resources dynamically partitioned

- Re-order Buffer
- Rename buffers
- Reservation station

**Resources shared** 

- Caches
- TLB





### **KNL Memory Modes**

- Mode selected at boot
- MCDRAM-Cache covers all DDR



### Cache Model



Hybrid Model



(intel)

### MCDRAM: Cache vs Flat Mode





# **GETTING PERFORMANCE ON KNIGHTS LANDING**

### **Efficiency on Knights Landing**

- Ist Knights Landing systems appearing by end of year
- How do we prepare for this new processor without it at hand?
- Let's review the main performance-enabling features:
  - Up to 72 cores
  - 2x VPU / core, AVX-512
  - High-bandwidth MCDRAM
- Plenty of parallelism needed for best performance.



35

### MPI needs help

- Many codes are already parallel (MPI)
  - May scale well, but...
  - What is single-node efficiency?
  - MPI isn't vectorising your code...
  - It has trouble scaling on large shared-memory chips.
    - Process overheads
    - Handling of IPC
    - Lack of aggregation off-die
- Threads are most effective for many cores on a chip
- Adopt a hybrid thread-MPI model for clusters of many-core



36

### OpenMP 4.x

- OpenMP helps express thread- and vector-level parallelism via directives
  - (like #pragma omp parallel, #pragma omp simd)
- Portable, and powerful
- Don't let simplicity fool you!
  - It doesn't make parallel programming easy
  - There is no silver bullet
- Developer still must expose parallelism & test performance



### Lessons from Previous Architectures

- Vectorization:
  - Avoid cache-line splits; align data structures to 64 bytes.
  - Avoid gathers/scatters; replace with shuffles/permutes for known sequences.
  - Avoid mixing SSE, AVX and AVX512 instructions.
- Threading:
  - Ensure that thread affinities are set.
  - Understand affinity and how it affects your application (i.e. which threads share data?).
  - Understand how threads share core resources.



38

### Data Locality: Nested Parallelism

- Recall that KNL cores are grouped into tiles, with two cores sharing an L2.
- Effective capacity depends on locality:
  - 2 cores sharing no data => 2 x 512 KB
  - 2 cores sharing all data => 1 x 1 MB
- Ensuring good locality (e.g. through blocking or nested parallelism) is likely to improve performance.

| 2 VPU | HUB    | 2<br>VPU |
|-------|--------|----------|
| Core  | 1MB L2 | Core     |

```
#pragma omp parallel for num_threads(ntiles)
for (int i = 0; i < N; ++i)
{
    #pragma omp parallel for num_threads(8)
    for (int j = 0; j < M; ++j)
    {
        ...
    }
}</pre>
```



### Flat MCDRAM: SW Architecture

### MCDRAM exposed as a separate NUMA node





- Memory allocated in DDR by default
  - Keeps low bandwidth data out of MCDRAM.
- Apps explicitly allocate important data in MCDRAM
  - "Fast Malloc" functions: Built using NUMA allocations functions
  - "Fast Memory" Compiler Annotation: For use in Fortran.

### Flat MCDRAM using existing NUMA support in Legacy OS



### **Memory Allocation Code Snippets**

#### Allocate 1000 floats from DDR

float \*fv;

fv = (float \*)malloc(sizeof(float) \* 1000);

#### Allocate 1000 floats from MCDRAM

float \*fv;

fv = (float \*)hbw\_malloc(sizeof(float) \* 1000);

#### Allocate arrays from MCDRAM & DDR in Intel FORTRAN

```
Declare arrays to be dynamic
С
      REAL, ALLOCATABLE :: A(:), B(:), C(:)
!DIR$ ATTRIBUTES FASTMEM :: A
      NSIZE=1024
С
      allocate array 'A' from MCDRAM
С
C
      ALLOCATE (A(1:NSIZE))
С
      Allocate arrays that will come from DDR
С
С
      ALLOCATE
                (B(NSIZE), C(NSIZE))
```



### hbwmalloc – "Hello World!" Example

```
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <hbwmalloc.h>
int main(int argc, char **argv)
         const size_t size = 512;
char *default_str = NULL;
         char *hbw str = NULL:
         default_str = (char *)malloc(size);
         if (default_str == NULL) {
                   perror("malloc()");
                   fprintf(stderr, "Unable to allocate default string\n");
                   return errno ? -errno : 1:
          3
         hbw_str = (char *)hbw_malloc(size);
         if (hbw str == NULL) {
                   ______
perror("hbw_malloc()");
fprintf(stderr, "Unable to allocate hbw string\n");
                   return errno ? -errno : 1:
          }
         sprintf(default_str, "Hello world from standard memory\n");
sprintf(hbw_str, "Hello world from high bandwidth memory\n");
fprintf(stdout, "%s", default_str);
fprintf(stdout, "%s", hbw_str);
         hbw_free(hbw_str):
         free(default str):
          return 0;
```

Fallback policy is controlled with hbw\_set\_policy:

- HBW\_POLICY\_BIND
- HBW\_POLICY\_PREFERRED
- HBW\_POLICY\_INTERLEAVE

Page sizes can be passed to hbw\_posix\_memalign\_psize:

- HBW\_PAGESIZE\_4KB
- HBW\_PAGESIZE\_2MB
- HBW\_PAGESIZE\_1GB

42

## Memory Modes

#### MCDRAM as Cache

- Upside:
  - No software modifications required.
  - Bandwidth benefit.
- Downside:
  - Latency hit to DDR.
  - Limited sustained bandwidth.
  - All memory is transferred DDR -> MCDRAM -> L2.
  - Less addressable memory.

#### Flat Mode

#### • Upside:

- Maximum bandwidth and latency performance.
- Maximum addressable memory.
- Isolate MCDRAM for HPC application use only.
- Downside:
  - Software modifications required to use DDR and MCDRAM in the same application.
  - Which data structures should go where?
  - MCDRAM is a limited resource and tracking it adds complexity.

