

#### Goals

#### Review

- asymptotic optimality - pebble game - mired nous

# Alignment

Alignment describes the process of matching the base address of:

- Single word: double, float
- SIMD vector
- ► Larger structure

To machine granularities:

| • | Calle Cihe  | o word size                                               |
|---|-------------|-----------------------------------------------------------|
| ٩ | Memory page | <ul> <li>SIMP ve dur site</li> <li>ORAM access</li> </ul> |

Q: What is the performance impact of misalignment?

Performance Impact of Misalignment



#### SIMD: Basic Idea

What's the basic idea behind SIMD?



Typically characterized by width of data path:

- SSE: 128 bit (4 floats, 2 doubles)
- AVX-2: 256 bit (8 floats, 4 doubles)
- AVX-512: 512 bit (16 floats, 8 doubles)

### SIMD: Architectural Issues



87

Realization of inter-lane comm. in SIMD? Find instructions.

Name tricky/inefficient aspects in terms of expressing SIMD:



x86 SIMD suffixes: What does the "ps" suffix mean? "sd"?



### SIMD: Transposes

Why are transposes important? Where do they occur?



Example implementation aspects:

- ► HPTT: [Springer et al. '17]
- github: springer13/hptt 8x8 transpose microkernel
- ▶ Q: Why 8x8?

## Outline

#### Introduction

Notes Notes (unfilled, with empty boxes) Notes (source code on Github) About This Class Why Bother with Parallel Computers? Lowest Accessible Abstraction: Assembly Architecture of an Execution Pipeline Architecture of a Memory System Shared-Memory Multiprocessors

Machine Abstractions

Performance: Expectation, Experiment, Observation

Parformance Oriented Languages and Abstractions

## Multiple Cores vs Bandwidth

Assume:

- memory latency of 100 ns
- peak DRAM bandwidth of 100 GB/s (per socket)

How many cache lines should be/are in flight at one time?

[McCalpin '18]

# Topology and NUMA



[SuperMicro Inc. '15] Demo:

- Show 1stopo on porter, from <u>hwloc</u>.
- ▶ 1stopo on MI300

#### Placement and Pinning

Who decides on what core my code runs? How?

Who decides on what NUMA node memory is allocated?

Demo: intro/NUMA and Bandwidths What is the main expense in NUMA?