First, read the code snippet below, which is a simple program that simulates heat conduction in a 2D plate.
for (k = 0; k < step; k++) { for (i = 1; i < N - 1; i++) { for (j = 1; j < N - 1; j++) { p_next[i * N + j] = (p[(i - 1) * N + j] + p[(i + 1) * N + j] + p[i * N + j + 1] + p[i * N + j - 1]) / 4.0; } } double *temp = p; p = p_next; p_next = temp; }
Here, p
and p_next
are 2D arrays representing the current and next state of the plate,
respectively.
The loop iterates over each cell in the 2D array and calculates the next state based on the surrounding cells.
Make sure you read through the entire specification before starting the project.
Just like all other programming assignments/projects this semester, you will obtain your files from the GitHub classroom, the link is here.
The framework contains the following files:
impl(int N, int step, double *p)
.
When the function returns, the array pointed to by p
should contain the state after
step
iterations from its initial values. However, to enable more kinds of optimization techniques, there is a
special
formula to calculate the error between your implementation and the baseline implementation. Please
READ THE FOLLOWING SECTION
CAREFULLY, or you may LOSE POINTS!
We will calculate the error between the state of your p
matrix and the baseline according to the
following rules:
N
and the number of iterations is step
.step
iterations using the baseline algorithm be
p^{step}
,
and the result after step-1
iterations be p^{step-1}
. Denote the value at a point
in
the matrix as p_{i,j}
.
p_{i,j}
in your p
, the error is calculated as follows:
Assuming the baseline's values after step
iterations are as shown in the figure below:
Then, as long as your result is close enough to any of the ones shown in the figure below (only partial examples are displayed), your result will be considered valid.
You can find many obvious optimizations in the implementation we provide, based on what you have learned in Computer Architecture. We are listing some of the possible approaches below:
There are some optimization flags that can be turned on in GCC. The available flags for x86 ISA from GCC are
listed
here. However, we hope that
you can do most of the optimization yourself, instead of relying on the compiler.
So in this project, you are not allowed to modify the Makefile
.
The outer loop of the algorithm is a good candidate for multithreading. You can use OpenMP
or
pthread
to parallelize.
SIMD processes multiple data in parallel. Part of this algorithm is also a good candidate for SIMD instructions.
Loop unrolling can be used to reduce the overhead of the loop.
The main idea here is to make memory access more efficient and exploit memory locality.
The total score for this project is 100:
OpenMP
, you will receive 50 points.
3.0
and the bare OpenMP
speedup is 5.0
,
then your score will be (3.0 / 5.0) * 50 = 30.0
points.
7.3
and the bare OpenMP
speedup is 5.0
,
while the 30th fastest code is 10.0
then your score will be
50 + ((7.3 - 5.0) / (10.0 - 5.0)) * 50 = 73.0
points.
For the final speed test, we will use a map with N=2000
.
The GrapeScope server is used for correctness check only because it has limited hardware resources and unstable performance.
CPU: Intel Xeon E5-2690 v4 2.6 GHz, 14 cores (28 threads) Details here
The final test server in fact has two cpus that uses the NUMA architecture which is commonly used in modern servers. In the final test we will use numactl to run the program in a NUMA node. This is to avoid the impact of NUMA's memory arrangement on openmp. If you're interested, you can find out what numa is and why it has an effect on openmp. But we're not asking too much here, you can simply think of it as having only one CPU with 14 cores, as the same as the server configurations listed above.