Recursive Descent Parsing#
Note
Source: Adapted from recursion-book-rd-parser by George K. Thiruvathukal.
A recursive descent parser translates a token sequence into a tree using one function per grammar rule. When a grammar rule refers to another rule, the function calls that other function; when a rule refers to itself, the function recurses. This makes the correspondence between grammar and code almost mechanical.
We will build a calculator that parses and evaluates arithmetic
expressions such as 2 + 3 * 4 and 2 ^ (3 + 1), correctly
handling operator precedence and associativity.
The Grammar#
Operator precedence is encoded by layering rules. Lower-precedence operators appear higher in the grammar and call higher-precedence rules to collect their operands:
expr -> term (('+' | '-') term)* addition / subtraction
term -> unary (('*' | '/') unary)* multiplication / division
unary -> ('+' | '-') power | power unary sign
power -> primary ('^' power)? exponentiation
primary -> NUMBER | '(' expr ')' atom or grouped expression
The * in (... term)* means “zero or more” and is handled by a
loop. The ? in ('^' power)? means “zero or one” and is handled
by an if. primary calling expr is mutual recursion: it
re-enters the grammar at the top to evaluate a parenthesised
sub-expression.
AST Nodes#
Each node in the abstract syntax tree (AST) is a NamedTuple:
class Number(NamedTuple):
value: float
class UnaryOp(NamedTuple):
op: str
operand: 'Expr'
class BinOp(NamedTuple):
left: 'Expr'
op: str
right: 'Expr'
Expr = Union[Number, UnaryOp, BinOp]
Number holds a literal value. UnaryOp holds an operator and its
single operand. BinOp holds an operator and its two operands.
Expr is a type alias that lets the type checker verify tree structure.
Tokenizing the Input#
Before parsing, the expression string is split into tokens — the smallest meaningful units (numbers and operators):
def tokenize_expr(code: str) -> List[Token]:
"""Return a list of (type, value) tokens from an arithmetic expression."""
tokens: List[Token] = []
stream = BytesIO(code.encode('utf-8')).readline
for tok in tokenize.tokenize(stream):
if tok.type == tokenize.NUMBER:
tokens.append(('NUMBER', tok.string))
elif tok.type == tokenize.OP and tok.string in ALLOWED_OPERATORS:
tokens.append(('OP', tok.string))
return tokens
Python’s built-in tokenize module does the heavy lifting.
tokenize_expr filters the token stream to only numbers and the seven
allowed operators, discarding whitespace and encoding metadata.
print(tokenize_expr("2 + 3 * 4"))
Output:
[('NUMBER', '2'), ('OP', '+'), ('NUMBER', '3'), ('OP', '*'), ('NUMBER', '4')]
The Parser#
The Parser class holds the token list and a position cursor. Each
grammar rule becomes one method that advances the cursor and returns an
AST node:
class Parser:
"""Recursive descent parser for arithmetic expressions.
Grammar (each rule is one method):
expr -> term (('+' | '-') term)* left-associative
term -> unary (('*' | '/') unary)* left-associative
unary -> ('+' | '-') power | power
power -> primary ('^' power)? right-associative
primary -> NUMBER | '(' expr ')'
"""
def __init__(self, tokens: List[Token]):
self.tokens = tokens
self.pos = 0
def _current(self) -> Token:
if self.pos < len(self.tokens):
return self.tokens[self.pos]
return ('EOF', '')
def _advance(self) -> None:
self.pos += 1
def _match(self, tok_type: str, value: str = None) -> bool:
t, v = self._current()
if t == tok_type and (value is None or v == value):
self._advance()
return True
return False
def parse(self) -> Expr:
result = self.expr()
if self._current()[0] != 'EOF':
raise SyntaxError(f"Unexpected token: {self._current()}")
return result
def expr(self) -> Expr:
node = self.term()
while self._match('OP', '+') or self._match('OP', '-'):
op = self.tokens[self.pos - 1][1]
node = BinOp(node, op, self.term())
return node
def term(self) -> Expr:
node = self.unary()
while self._match('OP', '*') or self._match('OP', '/'):
op = self.tokens[self.pos - 1][1]
node = BinOp(node, op, self.unary())
return node
def unary(self) -> Expr:
if self._match('OP', '+') or self._match('OP', '-'):
op = self.tokens[self.pos - 1][1]
return UnaryOp(op, self.unary())
return self.power()
def power(self) -> Expr:
node = self.primary()
if self._match('OP', '^'):
node = BinOp(node, '^', self.power()) # right-associative: recurse
return node
def primary(self) -> Expr:
if self._match('NUMBER'):
return Number(float(self.tokens[self.pos - 1][1]))
elif self._match('OP', '('):
node = self.expr() # mutual recursion
if not self._match('OP', ')'):
raise SyntaxError("Expected ')'")
return node
raise SyntaxError(f"Unexpected token: {self._current()}")
The helper _match checks whether the current token matches a given
type and value, advances if it does, and returns True or False.
Each grammar method calls the next-higher-precedence method first, then
loops or recurses over the operator it owns.
Left and Right Associativity#
Associativity controls how repeated equal-precedence operators group.
2 - 3 - 4 is left-associative: (2 - 3) - 4 = -5.
2 ^ 3 ^ 2 is right-associative: 2 ^ (3 ^ 2) = 512.
The grammar encodes this distinction directly in two patterns:
Left-associative operators (
+,-,*,/): use awhileloop that keeps consuming matching operators and builds the tree left-to-right.Right-associative operators (
^): use a singleifwith a recursive call, so the right operand is fully evaluated before it becomes the right child.
power demonstrates the recursive right-associative pattern:
def power(self) -> Expr:
node = self.primary()
if self._match('OP', '^'):
node = BinOp(node, '^', self.power()) # right-associative: recurse
return node
Calling self.power() on the right side builds the right subtree
first, producing right-to-left grouping.
Evaluating the AST#
The evaluator walks the AST using the same recursive structure as the grammar: a node containing sub-nodes is evaluated by recursively evaluating those sub-nodes first:
def eval_expr(node: Expr) -> float:
"""Recursively evaluate an expression AST and return a float."""
match node:
case Number(value):
return value
case UnaryOp(op, operand):
val = eval_expr(operand)
match op:
case '+': return +val
case '-': return -val
case _: raise ValueError(f"Unknown unary op: {op}")
case BinOp(left, op, right):
lval = eval_expr(left)
rval = eval_expr(right)
match op:
case '+': return lval + rval
case '-': return lval - rval
case '*': return lval * rval
case '/': return lval / rval
case '^': return lval ** rval
case _: raise ValueError(f"Unknown binary op: {op}")
case _:
raise TypeError(f"Unknown AST node: {node}")
Python 3.10+ match statements destructure each NamedTuple variant
cleanly. The Number case returns the stored value directly.
UnaryOp and BinOp each call eval_expr recursively on their
operand(s) before applying the operator. This is structural recursion:
the recursion follows the shape of the data, mirroring the way the parser
built the tree in the first place.
Putting It Together#
A thin wrapper ties the three phases together:
def calc(expression: str) -> float:
"""Tokenize, parse, and evaluate an arithmetic expression string."""
tokens = tokenize_expr(expression)
ast = Parser(tokens).parse()
return eval_expr(ast)
print(calc("2 + 3 * 4")) # multiplication before addition
print(calc("(2 + 3) * 4")) # parentheses override precedence
print(calc("2 ^ 3 ^ 2")) # right-assoc: 2 ^ (3^2) = 2^9
print(calc("-5 + 2")) # unary minus
Output:
14.0
20.0
512.0
-3.0
Each phase is independently testable: the tokenizer by inspecting its
output list, the parser by printing the AST, and the evaluator by
constructing nodes by hand. Property-based tests can generate random
valid expressions and compare calc() against Python’s own eval()
to verify correctness across a wide range of inputs.