Project 1.1: A RISC-V Assembler (Individual Project)

IMPORTANT INFO - PLEASE READ

The projects are part of your CS110P Design Project, worth 2 credit points. Project 1.1 contributes 16% to your CS110P grade. These projects run in parallel with the course, meaning that project and homework deadlines may be very close to each other. Start early and avoid procrastination!

Introduction to Project 1.1

Our objective is to implement a basic RISC-V assembler that converts assembly instructions into machine code. This implementation will support the .data segment of assembly files and focus specifically on the RV32I instruction set along with partial RV32M extensions.

Before you start, please accept the assignment via this link. A repository containing the starter code will be generated to you. You can then start on the assignment by cloning it to your Ubuntu system. To submit the assignment, you should push your local clone to GitHub and turn in your work on Gradescope by connecting to your GitHub repo.

Academic integrity is strictly enforced in this course and any plagiarism behavior will cause serious consequence. You have been warned.

The assembler operates in two phases:

Phase 1: Parse the assembly source (.S file), remove comments, populate a symbol table with labels, and generate an intermediate representation containing basic instructions and expanded pseudo-instructions.
Phase 2: Using the symbol table and intermediate representation, translate each instruction into its corresponding machine code (binary representation) and output the result in hexadecimal format.

For machine code transformation, you can refer to the RISC-V Green Sheet.

Background of the Instruction Set

Below is the instruction set you will be implementing:

R-Type Instructions (14 total)

add, sub, xor, or, and, sll, srl, sra, slt, sltu, mul, mulh, div, rem

I-Type Instructions (16 total)

addi, xori, ori, andi, slli, srli, srai, slti, sltiu, lb, lh, lw, lbu, lhu, jalr, ecall

S-Type Instructions (3 total)

sb, sh, sw

SB-Type Instructions (6 total)

beq, bne, blt, bge, bltu, bgeu

U-Type Instructions (2 total)

lui, auipc

UJ-Type Instruction (1 total)

jal

Pseudo Instructions (9 total)

beqz, bnez, li, mv, j, jr, jal, jalr, lw

For pseudo-instructions beyond RISC-V Green Sheet, refer to this link.

Hint: The immediate value following li may exceed the range supported by addi. In such cases, you should use a combination of lui and addi to correctly represent li.

Getting Started

Directory Tree



.

|-- Makefile

|-- README.md

|-- assembler.c

|-- assembler.h

├── test

│   ├── in

│   │   ├── labels.s

│   │   ...

│   ├── Makefile

│   └── ref

│       ├── labels.log

│       ...

|-- src

    |-- block.c

    |-- block.h

    |-- tables.c

    |-- tables.h

    |-- translate.c

    |-- translate.h

    |-- translate_utils.c

    |-- translate_utils.h

    |-- utils.c

    `-- utils.h

How a RISC-V Assembly File Is Translated

The main assembly process is implemented in assembler.c:assembler().

Consider an assembly input file with the following content:


      

li   x1, 0

li   x2, 5



loop:  

    addi x1, x1, 1

    blt  x1, x2, loop

    j    exit



exit:

Phase 1 - `pass_one()`

During this phase, the assembler processes the input file line by line:

Labels (e.g., loop and exit) are recorded in the symbol table table along with their corresponding addresses.
Instructions:
- Pseudo-instructions are expanded using their respective handlers and stored in the intermediate representation block blk.
- Regular instructions are recorded directly in blk.

After Phase 1, the intermediate results are as follows:

table (Symbol Table): Stores label-address mappings. See src/table.h and src/table.c for details.
```
8  loop

20 exit
```
blk (Intermediate Representation Block): Stores expanded instructions while retaining unresolved label references. For example, li x1, 0 is expanded into addi x1, x0, 0.
```
addi x1, x0, 0

addi x2, x0, 5

addi x1, x1, 1

blt x1, x2, loop

jal zero, exit
```

Phase 2 - `pass_two()`

In this phase, each instruction in blk is processed sequentially:

Labels are resolved using the symbol table (src/translate.c:translate_inst()).
Each instruction is converted into its corresponding machine code in binary form.

Final Output (Hexadecimal Machine Code)

This output represents the fully assembled machine code for the given input assembly file.

Parts to Implement

The following are the functions/structures you need to complete in various source files. Please carefully review the code and comments in the relevant documentation, complete the required operations, and return the correct values.

`assembler.c`

The main assembly workflow (assemble()) and helper functions are provided. You need to implement the following label addition function and two-phase parsing functions:

static int add_if_label(uint32_t input_line, char* str, uint32_t byte_offset, SymbolTable* symtbl);

int pass_one(FILE* input, Block* blk, SymbolTable* table);

int pass_two(Block* blk, SymbolTable* table, FILE* output);

`tables.h` and `tables.c`

You need to complete the SymbolTable structure defined in src/tables.h:

typedef struct {

  /* Define your data structure here. */

  uint32_t len;

  uint32_t cap;

  int mode;

} SymbolTable;

And implement following SymbolTable management interfaces in src/tables.c:

SymbolTable* create_table(int mode);

void free_table(SymbolTable* table);

static Symbol* lookup(SymbolTable* table, const char* name);

int add_to_table(SymbolTable* table, const char* name, uint32_t addr);

int64_t get_addr_for_symbol(SymbolTable* table, const char* name);

void resize_table(SymbolTable* table);

`translate_util.h` and `translate_util.c`

You need to implement the following utility functions, which will be frequently used during the instruction translation process:

int translate_num(long int* output, const char* str, ImmType type);

int translate_reg(const char* str);

int is_valid_imm(long imm, ImmType type);

Different instruction types have varying immediate value ranges. Add more immediate types in ImmType and complement corresponding range validation in is_valid_imm():

typedef enum {

  IMM_NONE,         /* No immediate value */

  IMM_12_SIGNED,    /* 12-bit signed number */

  

  /* Add more types here */

} ImmType;

Pass One Writing Function (`translate.c`)

write_pass_one() has already called relevant handlers for pseudo expansion. You need to complete the processing for general functions.

unsigned write_pass_one(Block* blk, const char* name, char** args, int num_args);

Pseudo Expansion (`translate.c`)

You need to implement the following functions to expand pseudoinstructions and save them in the intermediate representation block:

static const InstrInfo instr_table[];

unsigned transform_beqz(Block* blk, char** args, int num_args);

unsigned transform_bnez(Block* blk, char** args, int num_args);

unsigned transform_li(Block* blk, char** args, int num_args);

unsigned transform_mv(Block* blk, char** args, int num_args);

unsigned transform_j(Block* blk, char** args, int num_args);

unsigned transform_jr(Block* blk, char** args, int num_args);

unsigned transform_jal(Block* blk, char** args, int num_args);

unsigned transform_jalr(Block* blk, char** args, int num_args);

unsigned transform_lw(Block* blk, char** args, int num_args);

Instruction Writing Functions (`translate.c`)

You need to implement the following functions to translate regular instructions and output them in hexadecimal format:

int write_rtype(FILE* output, const InstrInfo* info, char** args, size_t num_args);

int write_itype(FILE* output, const InstrInfo* info, char** args, size_t num_args, uint32_t addr, SymbolTable* symtbl);

int write_stype(FILE* output, const InstrInfo* info, char** args, size_t num_args);

int write_sbtype(FILE* output, const InstrInfo* info, char** args, size_t num_args, uint32_t addr, SymbolTable* symtbl);

int write_utype(FILE* output, const InstrInfo* info, char** args, size_t num_args, uint32_t addr, SymbolTable* symtbl);

int write_ujtype(FILE* output, const InstrInfo* info, char** args, size_t num_args, uint32_t addr, SymbolTable* symtbl);

For the functions mentioned above, you can quickly locate them by searching for /* IMPLEMENT ME */ in the source files.

Each function should be implemented within the designated section:

/* === start === */

// Your implementation goes here.

...

/* === end === */

The framework code is only meant to provide a general approach. In addition to completing the required sections, please also pay attention to the return values. Some functions have default return values as placeholders, which may be out of the expected range. Make sure to update the return values accordingly.

Important: If you need to add helper functions, additional variables, structures, macro definitions, etc., you are free to include them in the files we provide. However, since the `Makefile` is fixed, please do not add extra files, as this will lead to compilation issues.

How to Run the Assembler

After running:

make assembler

an assembler executable will be generated. (This command will automatically invoke make clean first.)

To run your newly generated assembler, use the following command:

./assembler --input_file <input_file> --output_folder <output_folder>

<input_file>: The input RISC-V file.
<output_folder>: The directory where the output files will be stored.

When processing an input file such as test.S, the <output_folder> will contain two files:

test.out: Contains the binary instruction results.
test.log: A log file that records all errors or confirms a successful assembly.

To achieve a correct result, ensure that both of these files match the corresponding reference files provided.

Example Usage

If you want to compile testcases/testcase1.S and save the output in the out/ directory, run:

./assembler --input_file testcases/testcase1.S --output_folder out/

To check valid labels and instructions after the first pass, add the --test option:

./assembler --input_file <input_file> --output_folder <output_folder> --test

This will generate .tbl and .inst files in <output_folder> for intermediate verification.

Note: The correctness of these .tbl and .inst files does not affect your grade—they are provided solely as a reference for you to verify your intermediate results. If your .out and .log files are correct, you can ignore any differences in .tbl and .inst.

How to Perform Testing

1. Provided Test Cases

We have provided several test cases that you can run using:

make check

Test case inputs are stored in ./test/in/, and their outputs are saved in the ./test/out/ folder.

To remove all previously generated output, you can run:

make clean

Important: Running make check does not automatically execute make assembler. You must ensure that you have built the latest version of your assembler before running the tests.

Evaluation Criteria

Your assembler will be evaluated based on three aspects:

Correct binary instruction generation
Proper error handling (catching all errors)
Memory safety (no memory leaks)

Checking Output and Errors

For correctness, we compare your .log and .out files against the reference outputs using the diff command. If you see "Diff .out check failed" or "Diff .log check failed", it means your output differs from the expected result. You can manually compare the files using:

diff file1 file2

For more details on diff results, refer to this guide.

Memory Leak Detection

If you receive a "Valgrind check failed" message, check out/%.memcheck for error details.

To detect memory leaks, we use the following command:

valgrind --tool=memcheck --leak-check=full --track-origins=yes ./assembler --input_file <input_file> --output_folder <output_folder>

2. Running a Single Test Case

We also provide a way to run selected test cases, saving time and allowing you to test custom cases.
To run a specific test case:

make test TEST_NAME=<test_name>

This will use ./test/in/<test_name>.S as input and output results to ./test/out/.

If you want to include your own test case every time you run make check, you can modify the Makefile located in ./test/. Simply add the name of your test case to the FULL_TESTS variable.

Running Your Own Test Case

If you create a custom test file (test.S), follow these steps:

Move your test case to the correct folder:
```
mv test.S ./test/in/test.S
```
Run the test:
```
make test TEST_NAME=test
```

Using Venus for Test Case Creation

For easier test case generation, you can use Venus, a powerful RISC-V simulator.

Steps to Use Venus

Write RISC-V assembly instructions in the Editor page.
Navigate to the Simulator page to view the corresponding machine code.
Use the Dump button to export the machine code for reference.

Grading

Warning: Passing all local tests does not guarantee full marks!

The provided test cases are only a basic guide. The Online Judge (OJ) system contains many additional corner cases that will rigorously test your assembler.

You should thoroughly test your implementation to ensure robustness. Do not rely on OJ as your debugger!

Submission

Submit on Gradescope by selecting your GitHub repository and the right branch.

Only the last active submission will be accounted for your project grade. Make sure it is your best version.

Due: 2025/03/27, 23:59