Recursive Descent Parsing#

Note

Source: Adapted from recursion-book-rd-parser by George K. Thiruvathukal.

A recursive descent parser translates a token sequence into a tree using one function per grammar rule. When a grammar rule refers to another rule, the function calls that other function; when a rule refers to itself, the function recurses. This makes the correspondence between grammar and code almost mechanical.

We will build a calculator that parses and evaluates arithmetic expressions such as 2 + 3 * 4 and 2 ^ (3 + 1), correctly handling operator precedence and associativity.

The Grammar#

Operator precedence is encoded by layering rules. Lower-precedence operators appear higher in the grammar and call higher-precedence rules to collect their operands:

expr    -> term   (('+' | '-') term)*      addition / subtraction
term    -> unary  (('*' | '/') unary)*     multiplication / division
unary   -> ('+' | '-') power | power       unary sign
power   -> primary ('^' power)?            exponentiation
primary -> NUMBER | '(' expr ')'           atom or grouped expression

The * in (... term)* means “zero or more” and is handled by a loop. The ? in ('^' power)? means “zero or one” and is handled by an if. primary calling expr is mutual recursion: it re-enters the grammar at the top to evaluate a parenthesised sub-expression.

AST Nodes#

Each node in the abstract syntax tree (AST) is a NamedTuple:

class Number(NamedTuple):
    value: float


class UnaryOp(NamedTuple):
    op: str
    operand: 'Expr'


class BinOp(NamedTuple):
    left: 'Expr'
    op: str
    right: 'Expr'


Expr = Union[Number, UnaryOp, BinOp]

Number holds a literal value. UnaryOp holds an operator and its single operand. BinOp holds an operator and its two operands. Expr is a type alias that lets the type checker verify tree structure.

Tokenizing the Input#

Before parsing, the expression string is split into tokens — the smallest meaningful units (numbers and operators):

def tokenize_expr(code: str) -> List[Token]:
    """Return a list of (type, value) tokens from an arithmetic expression."""
    tokens: List[Token] = []
    stream = BytesIO(code.encode('utf-8')).readline
    for tok in tokenize.tokenize(stream):
        if tok.type == tokenize.NUMBER:
            tokens.append(('NUMBER', tok.string))
        elif tok.type == tokenize.OP and tok.string in ALLOWED_OPERATORS:
            tokens.append(('OP', tok.string))
    return tokens

Python’s built-in tokenize module does the heavy lifting. tokenize_expr filters the token stream to only numbers and the seven allowed operators, discarding whitespace and encoding metadata.

print(tokenize_expr("2 + 3 * 4"))

Output:

[('NUMBER', '2'), ('OP', '+'), ('NUMBER', '3'), ('OP', '*'), ('NUMBER', '4')]

The Parser#

The Parser class holds the token list and a position cursor. Each grammar rule becomes one method that advances the cursor and returns an AST node:

class Parser:
    """Recursive descent parser for arithmetic expressions.

    Grammar (each rule is one method):
        expr    -> term   (('+' | '-') term)*      left-associative
        term    -> unary  (('*' | '/') unary)*     left-associative
        unary   -> ('+' | '-') power | power
        power   -> primary ('^' power)?             right-associative
        primary -> NUMBER | '(' expr ')'
    """
    def __init__(self, tokens: List[Token]):
        self.tokens = tokens
        self.pos = 0

    def _current(self) -> Token:
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return ('EOF', '')

    def _advance(self) -> None:
        self.pos += 1

    def _match(self, tok_type: str, value: str = None) -> bool:
        t, v = self._current()
        if t == tok_type and (value is None or v == value):
            self._advance()
            return True
        return False

    def parse(self) -> Expr:
        result = self.expr()
        if self._current()[0] != 'EOF':
            raise SyntaxError(f"Unexpected token: {self._current()}")
        return result

    def expr(self) -> Expr:
        node = self.term()
        while self._match('OP', '+') or self._match('OP', '-'):
            op = self.tokens[self.pos - 1][1]
            node = BinOp(node, op, self.term())
        return node

    def term(self) -> Expr:
        node = self.unary()
        while self._match('OP', '*') or self._match('OP', '/'):
            op = self.tokens[self.pos - 1][1]
            node = BinOp(node, op, self.unary())
        return node

    def unary(self) -> Expr:
        if self._match('OP', '+') or self._match('OP', '-'):
            op = self.tokens[self.pos - 1][1]
            return UnaryOp(op, self.unary())
        return self.power()

    def power(self) -> Expr:
        node = self.primary()
        if self._match('OP', '^'):
            node = BinOp(node, '^', self.power())  # right-associative: recurse
        return node

    def primary(self) -> Expr:
        if self._match('NUMBER'):
            return Number(float(self.tokens[self.pos - 1][1]))
        elif self._match('OP', '('):
            node = self.expr()                      # mutual recursion
            if not self._match('OP', ')'):
                raise SyntaxError("Expected ')'")
            return node
        raise SyntaxError(f"Unexpected token: {self._current()}")

The helper _match checks whether the current token matches a given type and value, advances if it does, and returns True or False. Each grammar method calls the next-higher-precedence method first, then loops or recurses over the operator it owns.

Left and Right Associativity#

Associativity controls how repeated equal-precedence operators group. 2 - 3 - 4 is left-associative: (2 - 3) - 4 = -5. 2 ^ 3 ^ 2 is right-associative: 2 ^ (3 ^ 2) = 512.

The grammar encodes this distinction directly in two patterns:

Left-associative operators (+, -, *, /): use a while loop that keeps consuming matching operators and builds the tree left-to-right.
Right-associative operators (^): use a single if with a recursive call, so the right operand is fully evaluated before it becomes the right child.

power demonstrates the recursive right-associative pattern:

def power(self) -> Expr:
    node = self.primary()
    if self._match('OP', '^'):
        node = BinOp(node, '^', self.power())  # right-associative: recurse
    return node

Calling self.power() on the right side builds the right subtree first, producing right-to-left grouping.

Evaluating the AST#

The evaluator walks the AST using the same recursive structure as the grammar: a node containing sub-nodes is evaluated by recursively evaluating those sub-nodes first:

def eval_expr(node: Expr) -> float:
    """Recursively evaluate an expression AST and return a float."""
    match node:
        case Number(value):
            return value
        case UnaryOp(op, operand):
            val = eval_expr(operand)
            match op:
                case '+': return +val
                case '-': return -val
                case _: raise ValueError(f"Unknown unary op: {op}")
        case BinOp(left, op, right):
            lval = eval_expr(left)
            rval = eval_expr(right)
            match op:
                case '+': return lval + rval
                case '-': return lval - rval
                case '*': return lval * rval
                case '/': return lval / rval
                case '^': return lval ** rval
                case _: raise ValueError(f"Unknown binary op: {op}")
        case _:
            raise TypeError(f"Unknown AST node: {node}")

Python 3.10+ match statements destructure each NamedTuple variant cleanly. The Number case returns the stored value directly. UnaryOp and BinOp each call eval_expr recursively on their operand(s) before applying the operator. This is structural recursion: the recursion follows the shape of the data, mirroring the way the parser built the tree in the first place.

Putting It Together#

A thin wrapper ties the three phases together:

def calc(expression: str) -> float:
    """Tokenize, parse, and evaluate an arithmetic expression string."""
    tokens = tokenize_expr(expression)
    ast = Parser(tokens).parse()
    return eval_expr(ast)

print(calc("2 + 3 * 4"))      # multiplication before addition
print(calc("(2 + 3) * 4"))    # parentheses override precedence
print(calc("2 ^ 3 ^ 2"))      # right-assoc: 2 ^ (3^2) = 2^9
print(calc("-5 + 2"))          # unary minus

Output:

Each phase is independently testable: the tokenizer by inspecting its output list, the parser by printing the AST, and the evaluator by constructing nodes by hand. Property-based tests can generate random valid expressions and compare calc() against Python’s own eval() to verify correctness across a wide range of inputs.

Recursive Descent Parsing

Contents

Recursive Descent Parsing#

The Grammar#

AST Nodes#

Tokenizing the Input#

The Parser#

Left and Right Associativity#

Evaluating the AST#

Putting It Together#