site/blog/minicompiler-lexing-2020-10...

11 KiB

title date series tags
Minicompiler: Lexing 2020-10-29 rust
rust
templeos
compiler

Minicompiler: Lexing

I've always wanted to make my own compiler. Compilers are an integral part of my day to day job and I use the fruits of them constantly. A while ago while I was browsing through the TempleOS source code I found MiniCompiler.HC in the ::/Demos/Lectures folder and I was a bit blown away. It implements a two phase compiler from simple math expressions to AMD64 bytecode (complete with bit-banging it to an array that the code later jumps to) and has a lot to teach about how compilers work. For those of you that don't have a TempleOS VM handy, here is a video of MiniCompiler.HC in action:

You put in a math expression, the compiler builds it and then spits out a bunch of assembly and runs it to return the result. In this series we are going to be creating an implementation of this compiler that targets WebAssembly. This compiler will be written in Rust and will use only the standard library for everything but the final bytecode compilation and execution phase. There is a lot going on here, so I expect this to be at least a three part series. The source code will be in Xe/minicompiler in case you want to read it in detail. Follow along and let's learn some Rust on the way!

Compilers for languages like C are built on top of the fundamentals here, but they are much more complicated.

Description of the Language

This language uses normal infix math expressions on whole numbers. Here are a few examples:

  • 2 + 2
  • 420 * 69
  • (34 + 23) / 38 - 42
  • (((34 + 21) / 5) - 12) * 348

Ideally we should be able to nest the parentheses as deep as we want without any issues.

Looking at these values we can notice a few patterns that will make parsing this a lot easier:

  • There seems to be only 4 major parts to this language:
    • numbers
    • math operators
    • open parentheses
    • close parentheses
  • All of the math operators act identically and take two arguments
  • Each program is one line long and ends at the end of the line

Let's turn this description into Rust code:

Bringing in Rust

Make a new project called minicompiler with a command that looks something like this:

$ cargo new minicompiler

This will create a folder called minicompiler and a file called src/main.rs. Open that file in your editor and copy the following into it:

// src/main.rs

/// Mathematical operations that our compiler can do.
#[derive(Debug, Eq, PartialEq)]
enum Op {
    Mul,
    Div,
    Add,
    Sub,
}

/// All of the possible tokens for the compiler, this limits the compiler
/// to simple math expressions.
#[derive(Debug, Eq, PartialEq)]
enum Token {
    EOF,
    Number(i32),
    Operation(Op),
    LeftParen,
    RightParen,
}

In compilers, "tokens" refer to the individual parts of the language you are working with. In this case every token represents every possible part of a program.

And then let's start a function that can turn a program string into a bunch of tokens:

// src/main.rs

fn lex(input: &str) -> Vec<Token> {
    todo!("implement this");
}

Wait, what do you do about bad input such as things that are not math expressions? Shouldn't this function be able to fail?

You're right! Let's make a little error type that represents bad input. For creativity's sake let's call it BadInput:

// src/main.rs

use std::error::Error;
use std::fmt;

/// The error that gets returned on bad input. This only tells the user that it's
/// wrong because debug information is out of scope here. Sorry.
#[derive(Debug, Eq, PartialEq)]
struct BadInput;

// Errors need to be displayable.
impl fmt::Display for BadInput {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "something in your input is bad, good luck")
    }
}

// The default Error implementation will do here.
impl Error for BadInput {}

And then let's adjust the type of lex() to compensate for this:

// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
    todo!("implement this");
}

So now that we have the function type we want, let's start implementing lex() by setting up the result and a loop over the characters in the input string:

// src/main.rs

fn lex(input: &str) -> Result<Vec<Token>, BadInput> {
    let mut result: Vec<Token> = Vec::new();
    
    for character in input.chars() {
        todo!("implement this");
    }

    Ok(result)
}

Looking at the examples from earlier we can start writing some boilerplate to turn characters into tokens:

// src/main.rs

// ...

for character in input.chars() {
    match character {
        // Skip whitespace
        ' ' => continue,

        // Ending characters
        ';' | '\n' => {
            result.push(Token::EOF);
            break;
        }

        // Math operations
        '*' => result.push(Token::Operation(Op::Mul)),
        '/' => result.push(Token::Operation(Op::Div)),
        '+' => result.push(Token::Operation(Op::Add)),
        '-' => result.push(Token::Operation(Op::Sub)),

        // Parentheses
        '(' => result.push(Token::LeftParen),
        ')' => result.push(Token::RightParen),

        // Numbers
        '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
            todo!("implement number parsing")
        }

        // Everything else is bad input
        _ => return Err(BadInput),
    }
}

// ...

Ugh, you're writing Token:: and Op:: a lot. Is there a way to simplify that?

Yes! enum variants can be shortened to their names with a use statement like this:

// src/main.rs

// ...

use Op::*;
use Token::*;

match character {
    // ...

    // Math operations
    '*' => result.push(Operation(Mul)),
    '/' => result.push(Operation(Div)),
    '+' => result.push(Operation(Add)),
    '-' => result.push(Operation(Sub)),

    // Parentheses
    '(' => result.push(LeftParen),
    ')' => result.push(RightParen),

    // ...
}
    
// ...

Which looks a lot better.

You can use the use statement just about anywhere in your program. However to keep things flowing nicer, the use statement is right next to where it is needed in these examples.

Now we can get into the fun that is parsing numbers. When he wrote MiniCompiler, Terry Davis used an approach that is something like this (spacing added for readability):

case '0'...'9':
  i = 0;
  do {
    i = i * 10 + *src - '0';
    src++;
  } while ('0' <= *src <= '9');
  *num=i;

This sets an intermediate variable i to 0 and then consumes characters from the input string as long as they are between '0' and '9'. As a neat side effect of the numbers being input in base 10, you can conceptualize 40 as (4 * 10) + 2. So it multiplies the old digit by 10 and then adds the new digit to the resulting number. Our setup doesn't let us get that fancy as easily, however we can emulate it with a bit of stack manipulation according to these rules:

  • If result is empty, push this number to result and continue lexing the program
  • Pop the last item in result and save it as last
  • If last is a number, multiply that number by 10 and add the current number to it
  • Otherwise push the node back into result and push the current number to result as well

Translating these rules to Rust, we get this:

// src/main.rs

// ...

// Numbers
'0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' => {
    let num: i32 = (character as u8 - '0' as u8) as i32;
    if result.len() == 0 {
        result.push(Number(num));
        continue;
    }

    let last = result.pop().unwrap();

    match last {
        Number(i) => {
            result.push(Number((i * 10) + num));
        }
        _ => {
            result.push(last);
            result.push(Number(num));
        }
    }
}
            
// ...

This is not the most robust number parsing code in the world, however it will suffice for now. Extra credit if you can identify the edge cases!

This should cover the tokens for the language. Let's write some tests to be sure everything is working the way we think it is!

Testing

Rust has a robust testing framework built into the standard library. We can use it here to make sure we are generating tokens correctly. Let's add the following to the bottom of main.rs:

#[cfg(test)] // tells the compiler to only build this code when tests are being run
mod tests {
    use super::{Op::*, Token::*, *};

    // registers the following function as a test function
    #[test]
    fn basic_lexing() {
        assert!(lex("420 + 69").is_ok());
        assert!(lex("tacos are tasty").is_err());

        assert_eq!(
            lex("420 + 69"),
            Ok(vec![Number(420), Operation(Add), Number(69)])
        );
        assert_eq!(
            lex("(30 + 560) / 4"),
            Ok(vec![
                LeftParen,
                Number(30),
                Operation(Add),
                Number(560),
                RightParen,
                Operation(Div),
                Number(4)
            ])
        );
    }
}

This test can and probably should be expanded on, but when we run cargo test:

$ cargo test
   Compiling minicompiler v0.1.0 (/home/cadey/code/Xe/minicompiler)

    Finished test [unoptimized + debuginfo] target(s) in 0.22s
     Running target/debug/deps/minicompiler-03cad314858b0419

running 1 test
test tests::basic_lexing ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

And hey presto! We verified that all of the parsing is working correctly. Those test cases should be sufficient to cover all of the functionality of the language.


This is it for part 1. We covered a lot today. Next time we are going to run a validation pass on the program, convert the infix expressions to reverse polish notation and then also get started on compiling that to WebAssembly. This has been fun so far and I hope you were able to learn from it.

Special thanks to the following people for reviewing this post:

  • Steven Weeks
  • sirpros
  • Leonora Tindall
  • Chetan Conikee
  • Pablo
  • boopstrap
  • ash2x3