Deep Dive: Puffin
Puffin. A simple, interpreted programming language, implemented in Rust.
Puffin
Puffin is a dynamically typed programming language supporting a mix of imperative and functional features. A more complete list of features can be found in the Github readme.
Motivations and Origins
During my last quarter at the UW, I took a languages course, CSE 413. While this class served as a great intro to language implementation, I wanted to push myself a bit further.
My only initial goal was to build my own fully-featured programming language, from scratch. Puffin was actually my second attempt at building my own language, I initially created a simpler language called SMP. SMP served as a bit of a dry run for implementing Puffin, the smaller scope helped me focus on the essential features without getting bogged down in the complexity of a more featured language, the challenges here also saved me tons of headache writing Puffin.
SMP Language Internals
The SMP Language is quite simple, each program is made up of a series of top-level definitions, one of which must be a zero-argument function named main
. The main
function serves as the entry-point to the program. SMP only has 2 types, Num (backed by a Rust f64
), and Array (backed by a Rust Vec<f64>
).
The SMP interpreter is composed of 3 primary modules, scanner
, ast
, and interpreter
.
scanner
: A handwritten LL(1) parser, takes a program as aString
and producesToken
’s.ast
: Takes ascanner
for a program, produces the program AST.interpreter
: Takes a Program AST, and evaluates it.
As you can see these 3 modules form a pipeline that parse, validate, and execute an SMP program.
SMP Challenges
While simple, SMP was still a challenge to implement. Writing my own parser by hand felt like the easiest solution at the time, but in retrospect was way more trouble that it was worth. Limiting myself to LL(1) meant that I had to write some pretty hacky code in ast and interpreter to resolve ambiguities in the grammar.
I also wish I had spent more time refining the grammar and deciding language semantics before I started implementing it. There were many points where I found my language to be under-specified and was forced to refactor my code around some change. One example of this is the semantics of return
. If you’ll notice, SMP does not have a null
type, this means that every function must return a value, even if you just want to call it for a side-effect. The consequence of this is you end up returning useless values like 0
leading to pretty unclear and polluted code. Lets look at a simple program to print 1-10 in SMP via a function call.
# SMP
def printNums() {
lo := 1;
hi := 10;
# no for-loops
while (lo <= hi) {
# Statements that are standalone expression
# output to stdout,
# builtin functions were hacky already so I
# didn't implement a print
lo;
# no augmented assignment
lo := lo + 1;
}
# useless return value
return 0;
}
def main() {
printNums();
# but main doesn't need to return?
}
As you can see, printNums
is forced to return a garbage value: 0. However, this rule is not even consistent, my implementation doesn’t require main to return anything! By the time I realized SMP had flaws like this, it was too late, refactoring would have been more trouble that it was worth, and so it was time to move on to my next language, Puffin!
Puffin: Primary goals
When I started designing Puffin, I wanted to go beyond just iterating on SMP. I wanted an entirely new language with a more complex feature-set, yet a more maintainable codebase. I made the following decisions before writing a single line of code:
- Puffin should have a more complex type system, supporting Strings, dynamic arrays, structures, and first-class functions.
- I should use a parser generator instead of a handwritten parser to make new features easier, and to focus more on the interpreter.
- Builtin functions should be easy to add, and require as little separate code as possible compared to a regular function.
With these tenets in mind, I initialized my crate and began:
Puffin: Implementation
The Puffin interpreter actually mirrors SMP in structure, we again have 3 main modules that take on similar roles. parser
, ast
, and interpreter
.
Puffin: Parser
Puffins parser is not handwritten, nor is it a LL(1) parser. While Rust is still a relatively young language, the ecosystem did offer me a few choices for parser tools. In the end I settled on pest, a Parsing Expression Grammar (PEG) based parser which can derive a fully-functional parser from PEG. A PEG is a type of formal grammar which describes rules for recognizing strings in some language. PEG’s are similar to BNF, they describe a set of rules that some input must match. PEG’s have some additional features and restrictions that eliminate ambiguity, such as forbidding left-recursion, and making choices ordered. Let’s look at a simple example from the Puffin’s PEG.
// pestfile
statement = {
return_statement ~ ";" |
assign_statement ~ ";" |
exp ~ ";" |
nest
}
return_statement = { "return" ~ exp }
// assignment, can be augmented by certain operators
assign_statement = { exp ~ aug? ~ "=" ~ exp }
// assignment augmentation operators
aug = {
"+" |
"-" |
"*" |
"/" |
"%"
}
Translated in to english, this PEG tells us the following:
- A string matches “statement” if it is either a return_statement, assign_statement, expression, or a nest.
- A return statement is the literal “return”, followed by an expression.
- An assign statement is an expression, optionally followed by an augmentation, an equals sign, followed by an expression.
Other rules such as exp, nest, etc. are found elsewhere in the file.
As you can see, PEG’s are extremely expressive compared to BNF, and allow us to use regex to eliminate ugly recursive rules.
From here, pest does the rest, here is the entire definition of Puffin’s parser struct:
// Rust
#[derive(Parser)]
#[grammar = "puffin.pest"]
pub struct PuffinParser;
That’s it! The Parser
trait derivation does the work for us, and we can now parse arbitrary programs in Puffin. When it successfully parses some input our parser will return us a Pair<Rule>
, which is essentially a nested structure holding the parsed tokens of our program.
Puffin: AST
Now that we have our parser, we can move on to the AST. In Puffin, the AST module is responsible for translating our semi-structured parser output, into a fully-fledged program AST. AST is also responsible for parsing literals, and distinguishing keywords from identifiers. While the correspondence is not one to one, nearly every rule in our PEG has a corresponding build_
function in ast that is responsible for validating it and returning the appropriate AST node. Here are some examples:
// Rust
/// `Rule: program`
pub fn build_program(program: Pair<Rule>) -> Result<Program, ASTError> {
// snip
}
/// `Rule: statement`
pub fn build_statement(statement: Pair<Rule>) -> Result<Statement, ASTError> {
// snip
}
/// `Rule: return_statement`
fn build_return(return_statement: Pair<Rule>) -> Result<Statement, ASTError> {
// snip
}
/// `Rule: assign_statment`
fn build_assign(assign_statement: Pair<Rule>) -> Result<Statement, ASTError> {
// snip
}
This whole module essentially serves as a tree translation, in fact, if we strip down the output from the parser, and compare it to that of the AST, they are remarkably similar. Here is a comparison for this simple Puffin program:
// Puffin
add_nums = fn(a, b) {
return a + b;
};
println(add_nums(2, 5));
Parser:
rule: program,
rule: statement,
rule: assign_statment,
rule: exp,
rule: value,
rule: name,
rule: exp,
rule: value,
rule: function,
rule: name,
rule: block,
rule: statement,
rule: return_statment,
rule: exp,
rule: value,
rule: name,
rule: sum_op,
rule: value,
rule: name,
rule: statement,
rule: exp,
rule: value,
rule: name,
rule: post_op,
rule: call,
rule: exp,
rule: value,
rule: name,
rule: post_op,
rule: call,
rule: exp,
rule: value,
rule: num,
rule: exp,
rule: value,
rule: num,
rule: EOI,
AST:
Program {
Statement {
statement: Assign {
lhs: Assignable {
name: "add_nums",
rhs: Exp {
Value(
FunctionDef {
args: [
"a",
"b",
block: Block {
block: [
Statement {
statement: Return(
Exp {
Value(
Name(
"a",
Operator(
Infix(
Plus,
Value(
Name(
"b",
Statement {
statement: Exp(
Exp {
Value(
Name(
"println",
Operator(
Postfix(
Call(
exp: [
Value(
Name(
"add_nums",
Operator(
Postfix(
Call(
Exp {
Value(
Num(
2.0,
Exp {
Value(
Num(
5.0,
Ignoring the details, you can probably tell that they nearly the same structurally, the AST module just validates and adds additional structure to the parsers output, it’s just another tree!
Puffin: Interpreter
Now that we’ve produced our final program AST, we can actually execute our program. This is where the interpreter module comes in. From a high level, interpreter looks kind of like ast, there is an eval_
function for almost every AST node, responsible for evaluating the node, and calling other functions to evaluate subnodes. Lets take a look at a full example:
// Rust
fn eval_block(block: &Block, env: &Rc<RefCell<Environment>>
) -> Result<Option<Value>, InterpreterError> {
for statement in &block.block {
// propagate return statements
if let Some(return_value) = eval_statement(statement, env)? {
return Ok(Some(return_value));
}
}
Ok(None)
}
eval_block
is responsible for evaluating the Block
AST node. A block is just any collection of statements found between curlies, like the inside of an if-statement or loop. Because of this, we evaluate a block by just evaluating statements inside the block. eval_statement
actually returns an Option<Value>
, which is just the Rust way of saying that it might return a value; in the Puffin interpreter, only return-statements produce values, this allows us to propagate their value to the callers of eval_statement
without doing so for other kinds of statements. eval_block
uses the same pattern, as you can see the return type also contains Option<Value>
, which is to say, the block might produce a value if it has a return statement in it, or it might produce nothing. If one of the statements in our block returns a value, we know it was a return-statement, and therefore we can stop evaluating the block and return.
Another pervasive pattern throughout the interpreter is the passing of an “Environment”. Nearly every eval function has a parameter env: &Rc<RefCell<Environment>>
. The inner type here is Environment
, a struct found in the submodule of interpreter; value
. Environment
boils down to a map between names and value
’s, it holds all of the bindings in our program at runtime. If you run the following Puffin code: x = 123.4
, you will end up the the following entry in the Environment
: "x": Value::Num(123.4)
. Passing this mapping between functions basically gives the evaluation context for any piece of code, it lets us resolve names to values, and update bindings as we go.
Puffin: Value
Lastly, we will discuss the value
module inside of interpreter
, this module contains the Value
enum definition, which actually holds the runtime value of all types in Puffin. Here’s what it looks like:
// Rust
pub enum Value {
/// Puffin Null
Null,
/// Puffin Number
Num(f64),
/// Puffin String
String(String),
/// Puffin Array
Array(Rc<RefCell<Vec<Value>>>),
/// Puffin Structure
Structure(Rc<RefCell<HashMap<String, Value>>>),
/// Puffin Closure
Closure {
kind: ClosureKind,
args: Vec<String>,
block: Block,
environment: Rc<RefCell<Environment>>,
},
/// Puffin Builtin function
Builtin(Builtin),
}
For primitive, pass-by-value types like Num and String, we just store the corresponding Rust type. For pass-by-reference types like Array
and Structure
, we use a Rc<RefCell<T>>
, which allows us to store a shared mutable reference. This also has the fantastic side effect of essentially acting as our garbage collector, Rc<T>
is a reference-counted pointer in Rust, it holds an inner T
, de-allocating it only when the last copy is de-allocated. By cloning the Rc<T>
, we can allow multiple Puffin variables to point to the same underlying value, something that wasn’t possible in SMP.
Puffin: Challenges
Despite having implemented SMP first, Puffin was not all smooth sailing. Many ubiquitous features of modern languages are not nearly as trivial as you might imagine to implement. One great example of this is recursion with first-class functions. Puffin functions evaluate to Closure
’s, a data-structure that holds the functions arguments, body, and local environment. This allows us to easily create functions like this: fn(a, b) => a+b
. While this is nice and simple for anonymous functions, it forces us to create special cases for named functions. If you scroll back up to the Value
definition, you will see that Closure
has a field kind
that holds a ClosureKind
, this is another enum that has one of 3 values:
// Rust
pub enum ClosureKind {
/// Anonymous, not assigned to a name
Anonymous,
/// Structure receiver, holds reference to structure which will be implicit first argument
Receiver(Rc<RefCell<HashMap<String, Value>>>),
/// Named function, name bound to closure in closures environment
Named(String),
}
Ignoring Receiver
for now, we can see we have both Anonymous
and Named
functions. If we evaluate the function fn(a, b) => a+b
we don’t have any name to bind it to, so it must be anonymous right? Well what if this expression was part of a larger statement like myfunc = fn(a, b) => a+b;
? This causes problems when evaluating the right hand side of the assignment, we no longer have the context that the expression is part of an assignment, and so we have no choice but to assume (for the time being) that the function is anonymous. Fortunately there is somewhere that we have that context; when we evaluate the assignment! We just need to include a check to see if the thing on the right is a closure, and if it is, we can update kind
to be Named
, and boom, Puffin now supports recursion, no need to change anything else! Here is that the check in eval_assign
looks like:
// Rust
/* ... snip ... */
// if we are binding a closure, convert to a named closure
if let Value::Closure { args,block, environment, .. } = value
{
let func_bind = Value::Closure {
kind: ClosureKind::Named(name.clone()),
args,
block,
environment,
};
/* ... snip ... */
}
/* ... snip ... */
To support recursion, we can use the value inside the Named
enum to bind the functions definition to itself when calling it:
// Rust
// This is inside eval_call
/* ... snip ... */
if let ClosureKind::Named(name) = kind {
// Here we bind the closure (callable) to the functions name
// in the environment the closure will execute in (subenv)
subenv.borrow_mut().bind(name, callable.clone())?;
}
/* ... snip ... */
Despite challenges like this, Rust made it incredibly easy to refactor my code to support new features with minimal impact. Enums with fields and exhaustive pattern matching make it trivial to add new cases without knocking over your existing code.
Puffin: Outcome
After a few weeks of work, I’m fairly proud of Puffin, while it’s not going to replace Python or C anytime soon, it is my fully-featured programming language, made completely from scratch! Check out the repositories readme and samples section for some cool demonstrations of Puffins features. If I could do it all again there are probably some changes I would make, later in development I squeezed in some object-oriented-ish features like receiver functions, while these work well from the users perspective, performance wise they behave like more of an afterthought, and that’s because they were. Ultimately however, this was a learning experience, I don’t intend for people to be running their high-performance workloads on Puffin, or anything at all for that matter, but it did give me a new appreciation for the complexity of creating and maintaining a language, and I picked up a few new tricks along the way.