SIMD makes a program faster by executing the same instruction on multiple data at the same time.
In this lab, we will use Intel Intrinsics to implement simple programs.
Part 1: Vector addition
In this part, you will implement a vector addition program using SIMD. Please "translate" naive_add() to simd_add().
You may use the following intrinsics, search in the Intel Intrinsics Guide:
_mm_loadu_si128
_mm_storeu_si128
_mm_add_epi32
Try to tell the difference of the following "load" intrisics:
_mm_load_si128
_mm_loadu_si128
_mm_load_pd
_mm_load1_pd
Part 2: Matrix multiplication
In this part, you will implement a matrix multiplication program using SIMD. Please "translate" naive_matmul() to simd_matmul().
You may use the following intrinsics:
_mm_setzero_ps
_mm_set1_ps
_mm_loadu_ps
_mm_add_ps
_mm_mul_ps
_mm_storeu_ps
Explain why this makes the program faster.
Part 3: Loop unrolling
Read Wikipedia and try to understand the concept of loop unrolling:
Implement loop_unroll_matmul() and loop_unroll_simd_matmul(), explain the performance boost they brought.
Part 4: Compiler optimization
Run make test, explain why -O3 makes the program much faster.
For checkup: Put this piece of code into godbolt.org , compile them with a risc-v compiler, and tell the difference between -O0 and -O3.
int a = 0;
void modify(int j) {
a += j;
}
int main() {
for (int i = 0; i < 1000; i++) {
a += 1;
}
for (int i = 0; i < 1000; i++) {
a += i;
}
return a;
}