Building an Interpreter, part 1

I'm working my way through Ruslan Spivak's tutorial Let’s Build A Simple Interpreter https://ruslanspivak.com/lsbasi-part1/

Here are my notes and reflections from part 1

Part 1

Compiler takes source code and preprocesses it into a machine language.

An interpreter interprets source code without turning it into machine language first.

Lexical analysis: read input of characters and convert into a stream of tokens. Token: an object represetning type and value. Example: the token "3" is an integer with value of 3

exercises

multiple digits

The assumptions are changing where instead of having digit->plus->digit we have an arbitrary number of digits followed by a plus followed by an arbitrary number of digits. Let’s write a function that eats tokens keeping track of their values until we hit a non integer. Then return the resulting values parsed as an integer.

  def eat_integers(self):
      """
      eats integers tokens until a non integer is found. 
      Returns teh resulting integers
      """
      result = ''
      while True:
          # eat tokens until you hit a non integer. Assume its a plus!
          curr_token = self.current_token
          try:
              self.eat(INTEGER)
              result += str(curr_token.value)
          except InterpreterParseError as e:
              return int(''.join(result))

spaces

Rather than have the interpreter process spaces, I am going to punt and remove them prior to processing. We can't allow spaces just anywhere however ('1 0 + 3' for example). For a space character to be legal both of its neighboors must be either a space or different token from one another. Here's how I'm doing it:

@staticmethod
def strip_text(text):
    """
    strips valid empty space from expression
    text: expression string
    returns: expression with valid spaces removed
    """
    split_expr = text.split('+') # split expression on plus, getting list of two expressions.
    stripped = [ char.strip() for char in split_expr] # strip leading and trailing white space
    return '+'.join(stripped) # join expressions together with a plus in the middle

Support addition or subtraction

I may have gone overboard on this one :) I wanted to generalize the concept of an operator which takes two numbers and, using a common interface returns the result. Further, I wanted to later be able to construct an AST of operator nodes and integer leaf nodes. All of these nodes respond to value which allows for computing an expression by calling value on the root and recursivly evaluating each node. Here's what I did.

First I extracted the Token class, to keep as my base class

class Token(object):
    """
    an abstract class representing all tokens
    Inheriting classes must:
        - define a self.type in their __init__
        - implement a `value` property
    """
    def __str__(self):
        """String representation of the class instance.
        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()

Next I defined an abstract OperatorToken which represents any binary operator.

class OperatorToken(Token):
    def __init__(self) -> None:
        __slots__ = 'left_value', 'right_value'
        self.left_value = None
        self.right_value = None

    def __str__(self):
        """String representation of the class instance.
        Examples:
            Token(INTEGER, 3)
            Token(PLUS '+')
        """
        return 'Token({type}, {left}, {right})'.format(
            type=self.type,
            left=self.left_value,
            right=self.right_value
        )

A general binary operator has a left_value and a right_value. The child class will implement value which determines the result.

Next, we've got two operators, addition and subtraction. Really their only differences are 1) they need to know their type and 2) they both need to implement a property value which returns either the sum or difference of their two values.

class AddToken(OperatorToken):
    def __init__(self) -> None:
        self.type = 'PLUS'
        super().__init__()

    @property
    def value(self):
        """
        returns the sum of left_value and right_value
        """
        return self.left_value.value + self.right_value.value

class SubtractToken(OperatorToken):
    def __init__(self) -> None:
        self.type = 'MINUS'
        super().__init__()

    @property
    def value(self):
        return self.left_value.value - self.right_value.value

Next, how to represent numbers? Really there's two different concepts to deal with. The first is an integer token, which is a single digit.

class IntToken(Token):
    """
    represents a single numeric character. Example '3'
    """
    def __init__(self, value: str) -> None:
        self.value = value
        self.type = INTEGER

Next is the concept of an entire number, which is composed of one or many digits. Let's think of that as an object which wraps a list of IntTokens

class IntWrapper(Token):
    """
    a list of IntTokens that represent an integer.
    """
    def __init__(self, tokens: List[IntToken] ) -> None:
        self.tokens = tokens
        self.type = INT_WRAPPER

    @property
    def value(self) -> int:
        number_strings = [int_token.value for int_token in self.tokens]
        return int(''.join(number_strings))

And finally, we have the EOF token. Not much to say there.

class EOFToken(Token):
    def __init__(self):
        self.type = EOF
        self.value = None

The update to expr in Interpreter looks like this

def expr(self):
    # set current token to the first token taken from the input
    self.current_token = self.get_next_token()

    left = self.eat_integers()

    # we expect the current token to be a '+' or '-' token
    operator = self.current_token
    self.eat(token.PLUS, token.MINUS)

    # we expect the current token to be a single-digit integer
    right = self.eat_integers()

    operator.left_value = left
    operator.right_value = right
    # after the above call the self.current_token is set to
    # EOF token
    return operator.value

Finally, I need to update my strip_text method to allow subtraction

def strip_text(self, text):
    """
    strips valid empty space from expression
    text: expression string
    returns: expression with valid spaces removed
    """

    try:
        operator = re.search(r'[\+-]', text).group(0) # find the operator, either a + or -
    except(AttributeError):
        raise self.error('expression does not contain an operator') # this error means neither operator was found. Throw our custom exception

    split_expr = text.split(operator)
    stripped = [ char.strip() for char in split_expr]
    return operator.join(stripped)

This architecture opens the door for constructing a more complex abstract syntax tree, with any number of binary operators.