Project 3: Encrypt and Hash

IMPORTANT INFO - PLEASE READ

The projects are part of CS110P course project worth 2 credit points. As such they run in parallel to the actual course. So be aware that the due date for project and homework might be very close to each other! Start early and do not procrastinate.

Introduction to Project 3

In this project, we hope you can use all knowledge about computer architecture that your have learned in this course to optimize a program with encryption and hashing. We'll provide the program code, and your job is to modify the code to improve its time performance. The following will first introduce you to the basics of encryption and hash.

Encryption

Encryption is a reversible process that converts plaintext into ciphertext using an algorithm and a cryptographic key. It ensures data confidentiality, allowing authorized parties with the correct key to decrypt and recover the original data. Examples include AES (symmetric) and RSA (asymmetric) encryption.

In this project, our input is a string of plaintext, which we first encrypt to get a string of ciphertext. Here we use a stream cipher called ChaCha to encrypt this plaintext.

ChaCha is a modern stream cipher designed by Daniel J. Bernstein in 2008. It is an evolution of the earlier Salsa20 cipher, optimized for improved performance and security. ChaCha operates by generating a pseudorandom keystream from a 256-bit secret key, a 96-bit nonce (number used once), and a counter. This keystream is then combined with plaintext via XOR to produce ciphertext.

The ChaCha stream cipher generates a pseudorandom keystream through iterative transformations of an initial 512-bit state matrix. This matrix is initialized with a 256-bit key, 96-bit nonce, 32-bit block counter, and predefined constants (e.g., “expand 32-byte k”). The state undergoes 20 rounds of mixing via the quarter-round function, which performs sequential additions, XORs, and bitwise rotations on grouped state words. After processing all rounds, the final state is combined with the initial matrix and serialized into a 512-bit keystream block. Each block is produced by incrementing the counter, ensuring uniqueness.

Hash

Hash (or hashing) is a one-way function that transforms input data into a fixed-length string of characters (digest). It is irreversible and deterministic, meaning the same input always produces the same hash. Hashes verify data integrity (e.g., file checksums) or securely store passwords (e.g., using SHA-256 or bcrypt). Unlike encryption, hashing does not involve keys and cannot restore the original data from the digest.

Here we use a hash tree to transform the previously obtained ciphertext into a fixed-length sequence.

A hash tree (also known as a Merkle tree) is a hierarchical data structure used to efficiently verify the integrity of large datasets. It works by recursively hashing pairs of data blocks until a single root hash, called the Merkle root, is generated. Each leaf node represents the hash of a data block, while non-leaf nodes are hashes of their combined child nodes. This structure allows quick and secure verification of specific data elements without requiring the entire dataset to be rechecked. Hash trees are widely used in blockchain systems (e.g., Bitcoin, Ethereum), file-sharing protocols, and version control systems to ensure data consistency and tamper resistance.

A Hash Tree

Here we only care about the root node of the Hash Tree and use it as the final output hash value.

Getting Started

Make sure you read through the entire specification before starting the project.

We provided the template in Github Classroom to get started.

Files

The framework contains the following files and folders:

Test Locally

Due to the large size of the input files, we will not provide the test data directly, but we need you to generate the relevant data locally.

First use make all to compile the program and tool, and then use ./tool . /testcases/test_0.meta to generate the required data files, then you can use ./program ./testcases/test_0.meta to test your program.

Optimization Techniques

You can find many obvious optimizations in the implementation we provide, based on what you have learned in Computer Architecture. We are listing some of the possible approaches below:

Compiler

There are some optimization flags that can be turned on in GCC. The available flags for x86 ISA from GCC are listed here. However, we hope that you can do most of the optimization yourself, instead of relying on the compiler. So in this project, you are not allowed to use GCC's optimisation flags .

Multithreading

The loop of the algorithm is a good candidate for multithreading. You can use OpenMP or pthread to parallelize.

SIMD instructions

SIMD processes multiple data in parallel. Part of this algorithm is also a good candidate for SIMD instructions. Some instruction about "srli", "slli", "or", "shuffle" may be helpful.

Loop unrolling

Loop unrolling can be used to reduce the overhead of the loop.

Cache Blocking

The main idea here is to make memory access more efficient and exploit memory locality.

Addition

Some students may want to try some SIMD instruction sets such as AVX512, but find that their computers do not support such instruction sets. In this case, you can choose to use Intel SDE to verify your correctness, or you can also use some commercial cloud environments. However, please note that autograder on gradescope does not support CPUID Flags like AVX512_BITALG, so if you want to use instruction with AVX512_BITALG CPUID Flags, you may need to implement additional logic in your code to determine if the environment supports this instruction.

Grading Policy

The total score for this project is 100:

  1. We will first run your code on small test cases on GradeScope (automatically). If your program does not produce the correct result, you will receive 0 points. As always, your code should not have any memory issues.
  2. After the deadline, we will test your code on our local machine and new test cases. Your grade on this part depends on the speedup of your code. If your code runs slower than adding the baseline (what we've provided), you will receive 0 points.
  3. If your code runs at the same speed with bare OpenMP, you will receive 50 points.
  4. If your code ranks among the top 30 in performance in the class, you will receive 100 points.
  5. For the rest, we will give a linear grade according to the speedup rate.
    • If your speedup is 3.0x and the bare OpenMP speedup is 5.0x, then your score will be (3.0 / 5.0) * 50 = 30.0 points.
    • If your speedup is 7.3x and the bare OpenMP speedup is 5.0x, while the 30th fastest code is 10.0x then your score will be 50 + ((7.3 - 5.0) / (10.0 - 5.0)) * 50 = 73.0 points.
  6. If your code crashes during our test, you will receive 0 points.
  7. Your submission must contain meaningful use of multithreading technique and Intel SIMD intrinsics. Otherwise, you will get 0 points. This check will be done manually after the deadline so there will be no feedback on this from GradeScope.

Submission

  • Submit on Gradescope by selecting your GitHub repository and the right branch.
  • Only the last active submission will be accounted for your project grade. Make sure it is your best version.
  • Due: 2025/05/29, 23:59
  • Parameters for Speed Test

    For the final test we will use 3 files with the same file size as the 3 tests provided and the total time will be used as the final time.

    Server Configurations

    GrapeScope Server (Used for correctness check):

    The GrapeScope server is used for correctness check only because it has limited hardware resources and unstable performance. However, we have provided you with a Leaderboard for rough speed referencing.

    Toast Lab Server (Used for final speed test):

    More about the final speed test

    The final test server in fact has two cpus that uses the NUMA architecture which is commonly used in modern servers. In the final test we will use numactl to run the program in a NUMA node. This is to avoid the impact of NUMA's memory arrangement on openmp. If you're interested, you can find out what numa is and why it has an effect on openmp. But we're not asking too much here, you can simply think of it as having only one CPU with 14 cores, as the same as the server configurations listed above.