# Forming Architectural Performance Expectations

In [1]:
!mkdir -p tmp

This demonstrates the [Intel Architecture Code Analyzer](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/), a tool released by Intel.

There is an open-source clone (by RRZ Erlangen-Nuernberg) called [osaca](https://github.com/RRZE-HPC/osaca).

In [2]:
%%writefile tmp/transpose.c

#include <x86intrin.h>
#include <iacaMarks.h>

// 8x8 transpose kernel stolen from
// https://github.com/springer13/hptt/blob/e1017ef8b8ed0b6f3bb3b70df825a87f94c643e8/src/transpose.cpp#L137

void execute(const float* __restrict__ A, const size_t lda, float* __restrict__ B, const size_t ldb, const float alpha ,const float beta)
{
    IACA_START
    
    
   __m256 reg_alpha = _mm256_set1_ps(alpha); // do not alter the content of B
   __m256 reg_beta = _mm256_set1_ps(beta); // do not alter the content of B
   //Load A
   __m256 rowA0 = _mm256_loadu_ps((A +0*lda));
   __m256 rowA1 = _mm256_loadu_ps((A +1*lda));
   __m256 rowA2 = _mm256_loadu_ps((A +2*lda));
   __m256 rowA3 = _mm256_loadu_ps((A +3*lda));
   __m256 rowA4 = _mm256_loadu_ps((A +4*lda));
   __m256 rowA5 = _mm256_loadu_ps((A +5*lda));
   __m256 rowA6 = _mm256_loadu_ps((A +6*lda));
   __m256 rowA7 = _mm256_loadu_ps((A +7*lda));

   //8x8 transpose micro kernel
   __m256 r121, r139, r120, r138, r71, r89, r70, r88, r11, r1, r55, r29, r10, r0, r54, r28;
   r28 = _mm256_unpacklo_ps( rowA4, rowA5 );
   r54 = _mm256_unpacklo_ps( rowA6, rowA7 );
   r0 = _mm256_unpacklo_ps( rowA0, rowA1 );
   r10 = _mm256_unpacklo_ps( rowA2, rowA3 );
   r29 = _mm256_unpackhi_ps( rowA4, rowA5 );
   r55 = _mm256_unpackhi_ps( rowA6, rowA7 );
   r1 = _mm256_unpackhi_ps( rowA0, rowA1 );
   r11 = _mm256_unpackhi_ps( rowA2, rowA3 );
   r88 = _mm256_shuffle_ps( r28, r54, 0x44 );
   r70 = _mm256_shuffle_ps( r0, r10, 0x44 );
   r89 = _mm256_shuffle_ps( r28, r54, 0xee );
   r71 = _mm256_shuffle_ps( r0, r10, 0xee );
   r138 = _mm256_shuffle_ps( r29, r55, 0x44 );
   r120 = _mm256_shuffle_ps( r1, r11, 0x44 );
   r139 = _mm256_shuffle_ps( r29, r55, 0xee );
   r121 = _mm256_shuffle_ps( r1, r11, 0xee );
   rowA0 = _mm256_permute2f128_ps( r88, r70, 0x2 );
   rowA1 = _mm256_permute2f128_ps( r89, r71, 0x2 );
   rowA2 = _mm256_permute2f128_ps( r138, r120, 0x2 );
   rowA3 = _mm256_permute2f128_ps( r139, r121, 0x2 );
   rowA4 = _mm256_permute2f128_ps( r88, r70, 0x13 );
   rowA5 = _mm256_permute2f128_ps( r89, r71, 0x13 );
   rowA6 = _mm256_permute2f128_ps( r138, r120, 0x13 );
   rowA7 = _mm256_permute2f128_ps( r139, r121, 0x13 );

  _mm256_storeu_ps((B + 0 * ldb), rowA0);
  _mm256_storeu_ps((B + 1 * ldb), rowA1);
  _mm256_storeu_ps((B + 2 * ldb), rowA2);
  _mm256_storeu_ps((B + 3 * ldb), rowA3);
  _mm256_storeu_ps((B + 4 * ldb), rowA4);
  _mm256_storeu_ps((B + 5 * ldb), rowA5);
  _mm256_storeu_ps((B + 6 * ldb), rowA6);
  _mm256_storeu_ps((B + 7 * ldb), rowA7);
    
  IACA_END
}


Overwriting tmp/transpose.c


In [3]:
!(cd tmp; gcc -c -march=haswell -I$HOME/pack/iaca-lin64/include transpose.c)
!~/pack/iaca-lin64/bin/iaca.sh -64 tmp/transpose.o

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - tmp/transpose.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 107.00 Cycles       Throughput Bottleneck: PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -   D   |   3   -   D   |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles | 10.6     0.0  | 13.6  | 107.0   69.0  | 107.0   69.0  | 84.0  | 24.0  | 13.7  |  8.0  |
--------------------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusio