Hitchhiker's Guide to AI, Software Architecture, and Everything Else: COMPILER CONSTRUCTION SERIES: BUILDING A PYGO COMPILER - ARTICLE 7: IMPLEMENTING A RECURSIVE DESCENT PARSER FOR PYGO

INTRODUCTION TO RECURSIVE DESCENT PARSING

Recursive descent parsing represents one of the most intuitive and widely-used approaches to syntax analysis in compiler construction. This parsing technique directly implements the grammar rules of a programming language through a collection of mutually recursive functions, where each function corresponds to a specific grammar production rule.

The fundamental principle underlying recursive descent parsing is the direct translation of context-free grammar rules into parsing functions. Each non-terminal symbol in the grammar becomes a function that recognizes and processes the corresponding syntactic construct. When a grammar rule references other non-terminals, the parsing function calls the appropriate functions recursively, creating a natural correspondence between the grammar structure and the parser implementation.

For the PyGo programming language, recursive descent parsing provides an excellent balance between implementation simplicity and parsing power. The technique naturally handles the hierarchical structure of programming languages while providing clear error reporting and recovery mechanisms. Unlike table-driven parsers that require complex preprocessing steps, recursive descent parsers can be implemented directly from grammar specifications with minimal tooling requirements.

The recursive descent approach offers several advantages for PyGo parser implementation. The parsing logic remains readable and maintainable because each function directly corresponds to a language construct. Error handling becomes straightforward because the parser can provide context-specific error messages based on the current parsing function. The parser can easily be extended to handle new language features by adding new functions or modifying existing ones.

However, recursive descent parsing also imposes certain constraints on the grammar design. The grammar must be free of left recursion, which would cause infinite recursion in the parsing functions. The grammar should also minimize ambiguity to ensure predictable parsing behavior. These constraints are generally not restrictive for most programming languages and can be addressed through careful grammar design.

GRAMMAR ANALYSIS AND TRANSFORMATION FOR RECURSIVE DESCENT

Before implementing a recursive descent parser for PyGo, we must analyze the language grammar to ensure it meets the requirements for this parsing technique. The grammar must be transformed to eliminate left recursion and reduce ambiguity while preserving the language semantics.

Left recursion occurs when a grammar rule begins with a reference to itself, either directly or indirectly. For example, a rule like "expression -> expression '+' term" would cause infinite recursion in a recursive descent parser because the parsing function would immediately call itself without consuming any input tokens.

The standard technique for eliminating left recursion involves transforming the grammar to use right recursion instead. The left-recursive rule "A -> A alpha | beta" becomes two rules: "A -> beta A'" and "A' -> alpha A' | epsilon". This transformation preserves the language recognized by the grammar while making it suitable for recursive descent parsing.

For PyGo expressions, we transform left-recursive arithmetic rules into a hierarchy of precedence levels. Instead of having a single expression rule that handles all operators, we create separate rules for each precedence level. This approach naturally implements operator precedence and associativity while eliminating left recursion.

The transformed PyGo grammar organizes expressions into multiple levels. The highest level handles logical OR operations, followed by logical AND, equality comparisons, relational comparisons, addition and subtraction, multiplication and division, unary operations, and finally primary expressions like literals and identifiers.

Another important consideration is the handling of optional and repeated constructs. Grammar rules that include optional elements or repetition must be carefully implemented to avoid infinite loops or incorrect parsing behavior. The recursive descent parser uses lookahead to determine whether optional elements are present and implements repetition through iterative loops rather than recursion.

LOOKAHEAD AND PREDICTION IN RECURSIVE DESCENT PARSING

Effective recursive descent parsing relies on lookahead to make parsing decisions without backtracking. Lookahead involves examining upcoming tokens in the input stream to determine which grammar rule to apply or which parsing path to follow. The amount of lookahead required depends on the grammar structure and the degree of ambiguity in the language.

For most programming language constructs, single-token lookahead suffices to make correct parsing decisions. The parser examines the current token to determine which production rule to apply. For example, when parsing a statement, the parser can examine the first token to distinguish between variable declarations, assignment statements, control flow statements, and expression statements.

PyGo statement parsing demonstrates effective use of single-token lookahead. A statement beginning with the "var" keyword indicates a variable declaration. A statement beginning with "if" indicates a conditional statement. A statement beginning with "while" indicates a loop statement. An identifier followed by an assignment operator indicates an assignment statement. Any other pattern indicates an expression statement.

In some cases, the parser may need to examine multiple tokens to make correct decisions. This situation arises when different language constructs share common prefixes. For PyGo function calls and variable references, both begin with an identifier, but function calls include parentheses while variable references do not. The parser must look ahead to the token following the identifier to make the correct choice.

The implementation of lookahead in recursive descent parsers typically involves maintaining a current token position and providing functions to peek at upcoming tokens without consuming them. The parser advances the token position only when it has confirmed the correct parsing path. This approach ensures that tokens are consumed in the correct order and that parsing errors can be detected and reported accurately.

Predictive parsing tables can be constructed for more complex lookahead scenarios, but PyGo's grammar structure generally allows for straightforward lookahead decisions. The key is to design the grammar so that different constructs can be distinguished by examining a small number of tokens at the beginning of each construct.

ERROR HANDLING AND RECOVERY STRATEGIES

Robust error handling is crucial for practical parser implementation because programmers frequently write syntactically incorrect code during development. A good parser should detect errors accurately, report them clearly, and attempt to recover gracefully to find additional errors in the same compilation unit.

Recursive descent parsers provide excellent opportunities for context-specific error reporting because each parsing function knows exactly what syntactic construct it is trying to parse. When an error occurs, the parser can provide detailed information about what was expected and what was actually encountered. This contextual information helps programmers understand and fix syntax errors quickly.

The basic error detection strategy involves checking whether the current token matches the expected token at each point in the parsing process. When a mismatch occurs, the parser reports an error with information about the expected token, the actual token, and the current parsing context. The error message should include the source location where the error was detected.

Error recovery is more challenging because the parser must decide how to continue parsing after encountering an error. The goal is to resynchronize the parser with the input stream so that it can detect additional errors. Several recovery strategies are commonly used in recursive descent parsers.

Panic mode recovery involves skipping tokens until the parser finds a synchronization point where it can resume normal parsing. Synchronization points are typically chosen to be tokens that mark the beginning or end of major syntactic constructs. For PyGo, good synchronization points include semicolons, braces, and keywords that begin statements or declarations.

Phrase-level recovery attempts to make local corrections to the input stream by inserting, deleting, or replacing tokens. This approach can handle common errors like missing semicolons or mismatched parentheses. However, phrase-level recovery must be implemented carefully to avoid making incorrect assumptions about the programmer's intent.

The PyGo parser implements a combination of error detection and panic mode recovery. When an error is detected, the parser reports the error with detailed context information and then skips tokens until it finds a suitable synchronization point. This approach allows the parser to continue processing the input and potentially find additional errors.

ABSTRACT SYNTAX TREE CONSTRUCTION DURING PARSING

One of the primary goals of syntax analysis is to construct an abstract syntax tree that represents the structure of the parsed program. In recursive descent parsing, AST construction can be integrated naturally into the parsing process, with each parsing function responsible for creating the appropriate AST nodes for the constructs it recognizes.

The AST construction process follows the structure of the grammar rules. When a parsing function successfully recognizes a syntactic construct, it creates an AST node representing that construct and populates it with information from the parsed tokens and child nodes returned by recursive calls to other parsing functions.

For PyGo expressions, the parsing functions create expression nodes that capture the operator, operands, and source location information. Binary expression nodes contain references to left and right operand nodes, while unary expression nodes contain a reference to a single operand node. Literal expression nodes contain the literal value and type information.

Statement parsing functions create statement nodes that represent the various PyGo statement types. Assignment statement nodes contain the variable name and the expression being assigned. Conditional statement nodes contain the condition expression and the statement blocks for the then and else branches. Loop statement nodes contain the loop condition and body statements.

Declaration parsing functions create declaration nodes for functions and variables. Function declaration nodes contain the function name, parameter list, return type, and body statements. Variable declaration nodes contain the variable name, type, and optional initialization expression.

The recursive structure of the parsing functions naturally creates the hierarchical structure of the AST. When a parsing function calls other parsing functions to handle sub-constructs, the returned AST nodes become children of the node being constructed by the calling function. This process continues recursively until the entire program has been parsed and a complete AST has been constructed.

Error handling during AST construction requires careful consideration of partial or invalid constructs. When parsing errors occur, the parser may need to create placeholder nodes or skip portions of the AST to maintain structural integrity. The error recovery strategy must ensure that the AST remains in a consistent state even when parsing errors are encountered.

IMPLEMENTING PARSING FUNCTIONS FOR PYGO CONSTRUCTS

The implementation of recursive descent parsing functions follows a systematic approach where each function corresponds to a specific grammar rule. The function examines the current token to determine which production to apply and then processes the tokens according to the selected production rule.

Expression parsing functions implement the operator precedence hierarchy through a series of functions that call each other in order of decreasing precedence. The highest precedence operations are handled by functions lower in the call chain, ensuring that they bind more tightly than lower precedence operations.

Here is an example of how expression parsing functions are structured:

PyGoValue parseExpression() {

return parseLogicalOr();

}

PyGoValue parseLogicalOr() {

PyGoValue left = parseLogicalAnd();

while (currentToken.type == OR) {

Token operator = currentToken;

advance();

PyGoValue right = parseLogicalAnd();

left = new BinaryExpression(left, operator, right);

}

return left;

}

This pattern continues through all precedence levels, with each function handling operators at its assigned precedence level and calling the next higher precedence function to parse operands.

Statement parsing functions use lookahead to determine the statement type and then call the appropriate specialized parsing function. The main statement parsing function acts as a dispatcher that examines the current token and delegates to specific statement parsers.

Statement parseStatement() {

switch (currentToken.type) {

case VAR:

return parseVariableDeclaration();

case IF:

return parseIfStatement();

case WHILE:

return parseWhileStatement();

case IDENTIFIER:

return parseAssignmentOrExpression();

default:

return parseExpressionStatement();

}

Function declaration parsing demonstrates the handling of complex constructs with multiple components. The function must parse the function keyword, name, parameter list, optional return type, and body block.

FunctionDeclaration parseFunctionDeclaration() {

expect(FUNC);

String name = expectIdentifier();

expect(LEFT_PAREN);

List<Parameter> parameters = parseParameterList();

expect(RIGHT_PAREN);

Type returnType = null;

if (currentToken.type == ARROW) {

advance();

returnType = parseType();

}

expect(COLON);

Block body = parseBlock();

return new FunctionDeclaration(name, parameters, returnType, body);

}

The expect function verifies that the current token matches the expected type and advances to the next token. If the token does not match, an error is reported. The expectIdentifier function is a specialized version that expects an identifier token and returns its text value.

HANDLING OPERATOR PRECEDENCE AND ASSOCIATIVITY

Operator precedence and associativity are critical aspects of expression parsing that must be handled correctly to ensure that expressions are evaluated in the intended order. Recursive descent parsers implement precedence through the structure of the parsing functions, with higher precedence operations handled by functions deeper in the call chain.

The precedence hierarchy for PyGo expressions follows standard mathematical and logical conventions. Parentheses have the highest precedence and can override the default precedence order. Unary operators like negation and logical NOT have higher precedence than binary operators. Among binary operators, multiplication and division have higher precedence than addition and subtraction.

Associativity determines how operators of the same precedence level are grouped when they appear in sequence. Most PyGo operators are left-associative, meaning that expressions like "a + b + c" are evaluated as "(a + b) + c". The recursive descent parser implements left associativity through iterative loops that build left-associative expression trees.

PyGoValue parseAdditive() {

PyGoValue left = parseMultiplicative();

while (currentToken.type == PLUS || currentToken.type == MINUS) {

Token operator = currentToken;

advance();

PyGoValue right = parseMultiplicative();

left = new BinaryExpression(left, operator, right);

}

return left;

}

This implementation creates a left-associative tree by making each new operation the parent of the previous result. The left operand of each new operation is the result of the previous operations, while the right operand is parsed fresh from the input.

Right-associative operators would be implemented using right recursion instead of iteration. However, PyGo does not include any right-associative operators, so this technique is not needed for the PyGo parser.

The precedence hierarchy is enforced by the calling relationships between parsing functions. Functions handling lower precedence operations call functions handling higher precedence operations to parse their operands. This ensures that higher precedence operations are parsed first and become children of lower precedence operations in the AST.

PARSING CONTROL FLOW CONSTRUCTS

Control flow statements in PyGo include conditional statements, loops, and function calls. These constructs have complex syntax that requires careful parsing to handle all the components correctly and provide good error messages when syntax errors occur.

Conditional statements begin with the "if" keyword followed by a condition expression, a colon, and a statement block. An optional "else" clause can follow with its own statement block. The parser must handle both the simple if statement and the if-else statement variants.

IfStatement parseIfStatement() {

expect(IF);

Expression condition = parseExpression();

expect(COLON);

Block thenBlock = parseBlock();

Block elseBlock = null;

if (currentToken.type == ELSE) {

advance();

expect(COLON);

elseBlock = parseBlock();

}

return new IfStatement(condition, thenBlock, elseBlock);

}

While loops follow a similar pattern with a condition expression and a body block. The parser must ensure that both components are present and properly formatted.

WhileStatement parseWhileStatement() {

expect(WHILE);

Expression condition = parseExpression();

expect(COLON);

Block body = parseBlock();

return new WhileStatement(condition, body);

}

For loops in PyGo have a more complex structure with an initialization expression, a condition expression, an increment expression, and a body block. The parser must handle all these components and ensure they are separated by the correct punctuation.

ForStatement parseForStatement() {

expect(FOR);

String iterator = expectIdentifier();

expect(EQUALS);

Expression initial = parseExpression();

expect(SEMICOLON);

Expression condition = parseExpression();

expect(SEMICOLON);

Assignment increment = parseAssignment();

expect(COLON);

Block body = parseBlock();

return new ForStatement(iterator, initial, condition, increment, body);

}

Block statements are fundamental building blocks used by many other constructs. A block consists of an opening brace, a sequence of statements, and a closing brace. The parser must handle empty blocks and blocks with multiple statements.

Block parseBlock() {

expect(LEFT_BRACE);

List<Statement> statements = new ArrayList<>();

while (currentToken.type != RIGHT_BRACE && currentToken.type != EOF) {

statements.add(parseStatement());

}

expect(RIGHT_BRACE);

return new Block(statements);

}

FUNCTION AND VARIABLE DECLARATION PARSING

Function declarations are among the most complex constructs in PyGo because they include multiple components: the function name, parameter list, optional return type, and function body. The parser must handle each component correctly and provide good error messages when any component is malformed.

The parameter list parsing requires special attention because it involves a comma-separated list of parameter declarations, each with a name and type. The parser must handle empty parameter lists, single parameters, and multiple parameters with proper comma separation.

List<Parameter> parseParameterList() {

List<Parameter> parameters = new ArrayList<>();

if (currentToken.type != RIGHT_PAREN) {

parameters.add(parseParameter());

while (currentToken.type == COMMA) {

advance();

parameters.add(parseParameter());

}

return parameters;

}

Parameter parseParameter() {

String name = expectIdentifier();

expect(COLON);

Type type = parseType();

return new Parameter(name, type);

}

Variable declarations can appear at the global level or within function bodies. They include a variable name, type specification, and optional initialization expression. The parser must handle both initialized and uninitialized variable declarations.

VariableDeclaration parseVariableDeclaration() {

expect(VAR);

String name = expectIdentifier();

expect(COLON);

Type type = parseType();

Expression initializer = null;

if (currentToken.type == EQUALS) {

advance();

initializer = parseExpression();

}

expectOptionalSemicolon();

return new VariableDeclaration(name, type, initializer);

}

Type parsing is relatively straightforward in PyGo because the language includes only basic types. The parser examines the current token to determine which type is being specified and creates the appropriate type node.

Type parseType() {

switch (currentToken.type) {

case INT_TYPE:

advance();

return new IntType();

case FLOAT_TYPE:

advance();

return new FloatType();

case STRING_TYPE:

advance();

return new StringType();

case BOOL_TYPE:

advance();

return new BoolType();

default:

error("Expected type specification");

return new IntType(); // Default for error recovery

}

INTEGRATION WITH LEXICAL ANALYSIS

The recursive descent parser must integrate closely with the lexical analyzer to obtain tokens from the input stream. The parser maintains a current token and provides functions to advance to the next token and to examine upcoming tokens without consuming them.

The token advancement mechanism is fundamental to parser operation. The parser calls the advance function to move to the next token after successfully matching the current token. The advance function obtains the next token from the lexical analyzer and updates the parser's current token.

void advance() {

if (currentToken.type != EOF) {

currentToken = lexer.nextToken();

}

The parser initialization process involves obtaining the first token from the lexical analyzer and setting up the parsing state. The parser must be prepared to handle the case where the input is empty or contains only whitespace and comments.

void initializeParser(Lexer lexer) {

this.lexer = lexer;

this.currentToken = lexer.nextToken();

this.errors = new ArrayList<>();

}

Token matching functions provide a convenient way to verify that the current token matches expectations and to advance to the next token. These functions form the foundation of the parsing process and are used extensively throughout the parser implementation.

void expect(TokenType expectedType) {

if (currentToken.type == expectedType) {

advance();

} else {

error("Expected " + expectedType + " but found " + currentToken.type);

}

String expectIdentifier() {

if (currentToken.type == IDENTIFIER) {

String value = currentToken.text;

advance();

return value;

} else {

error("Expected identifier but found " + currentToken.type);

return ""; // Default for error recovery

}

COMPLETE RECURSIVE DESCENT PARSER IMPLEMENTATION

The following section presents a complete recursive descent parser implementation for PyGo that demonstrates all the concepts discussed in this article. The parser integrates lexical analysis, syntax analysis, AST construction, and error handling into a cohesive system.

import java.util.*;

public class PyGoRecursiveDescentParser {

private PyGoLexer lexer;

private Token currentToken;

private List<ParseError> errors;

public PyGoRecursiveDescentParser(PyGoLexer lexer) {

this.lexer = lexer;

this.errors = new ArrayList<>();

this.currentToken = lexer.nextToken();

}

public ParseResult parse() {

try {

ProgramNode program = parseProgram();

return new ParseResult(program, errors);

} catch (ParseException e) {

errors.add(new ParseError(currentToken.line, currentToken.column, e.getMessage()));

return new ParseResult(null, errors);

}

private ProgramNode parseProgram() {

ProgramNode program = new ProgramNode(currentToken.line, currentToken.column);

while (currentToken.type != TokenType.EOF) {

try {

if (currentToken.type == TokenType.FUNC) {

program.addDeclaration(parseFunctionDeclaration());

} else if (currentToken.type == TokenType.VAR) {

program.addDeclaration(parseVariableDeclaration());

} else {

error("Expected function or variable declaration");

synchronize();

}

} catch (ParseException e) {

errors.add(new ParseError(currentToken.line, currentToken.column, e.getMessage()));

synchronize();

}

return program;

}

private FunctionDeclarationNode parseFunctionDeclaration() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.FUNC);

String name = expectIdentifier();

expect(TokenType.LEFT_PAREN);

List<ParameterNode> parameters = parseParameterList();

expect(TokenType.RIGHT_PAREN);

TypeNode returnType = null;

if (currentToken.type == TokenType.ARROW) {

advance();

returnType = parseType();

}

expect(TokenType.COLON);

BlockNode body = parseBlock();

FunctionDeclarationNode function = new FunctionDeclarationNode(line, column, name);

for (ParameterNode param : parameters) {

function.addParameter(param);

}

function.setReturnType(returnType);

function.setBody(body);

return function;

}

private List<ParameterNode> parseParameterList() {

List<ParameterNode> parameters = new ArrayList<>();

if (currentToken.type != TokenType.RIGHT_PAREN) {

parameters.add(parseParameter());

while (currentToken.type == TokenType.COMMA) {

advance();

parameters.add(parseParameter());

}

return parameters;

}

private ParameterNode parseParameter() {

int line = currentToken.line;

int column = currentToken.column;

String name = expectIdentifier();

expect(TokenType.COLON);

TypeNode type = parseType();

return new ParameterNode(line, column, name, type);

}

private VariableDeclarationNode parseVariableDeclaration() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.VAR);

String name = expectIdentifier();

expect(TokenType.COLON);

TypeNode type = parseType();

ExpressionNode initializer = null;

if (currentToken.type == TokenType.EQUALS) {

advance();

initializer = parseExpression();

}

expectOptionalSemicolon();

VariableDeclarationNode variable = new VariableDeclarationNode(line, column, name, type);

if (initializer != null) {

variable.setInitializer(initializer);

}

return variable;

}

private TypeNode parseType() {

int line = currentToken.line;

int column = currentToken.column;

String typeName;

switch (currentToken.type) {

case INT_TYPE:

typeName = "int";

advance();

break;

case FLOAT_TYPE:

typeName = "float";

advance();

break;

case STRING_TYPE:

typeName = "string";

advance();

break;

case BOOL_TYPE:

typeName = "bool";

advance();

break;

default:

error("Expected type specification");

typeName = "int";

break;

}

return new TypeNode(line, column, typeName);

}

private BlockNode parseBlock() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.LEFT_BRACE);

BlockNode block = new BlockNode(line, column);

while (currentToken.type != TokenType.RIGHT_BRACE && currentToken.type != TokenType.EOF) {

try {

block.addStatement(parseStatement());

} catch (ParseException e) {

errors.add(new ParseError(currentToken.line, currentToken.column, e.getMessage()));

synchronize();

}

expect(TokenType.RIGHT_BRACE);

return block;

}

private StatementNode parseStatement() {

switch (currentToken.type) {

case VAR:

return parseVariableDeclaration();

case IF:

return parseIfStatement();

case WHILE:

return parseWhileStatement();

case FOR:

return parseForStatement();

case RETURN:

return parseReturnStatement();

case IDENTIFIER:

return parseAssignmentOrExpressionStatement();

default:

return parseExpressionStatement();

}

private IfStatementNode parseIfStatement() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.IF);

ExpressionNode condition = parseExpression();

expect(TokenType.COLON);

BlockNode thenBlock = parseBlock();

IfStatementNode ifStatement = new IfStatementNode(line, column, condition, thenBlock);

if (currentToken.type == TokenType.ELSE) {

advance();

expect(TokenType.COLON);

BlockNode elseBlock = parseBlock();

ifStatement.setElseBlock(elseBlock);

}

return ifStatement;

}

private WhileStatementNode parseWhileStatement() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.WHILE);

ExpressionNode condition = parseExpression();

expect(TokenType.COLON);

BlockNode body = parseBlock();

return new WhileStatementNode(line, column, condition, body);

}

private ForStatementNode parseForStatement() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.FOR);

String iterator = expectIdentifier();

expect(TokenType.EQUALS);

ExpressionNode initialValue = parseExpression();

expect(TokenType.SEMICOLON);

ExpressionNode condition = parseExpression();

expect(TokenType.SEMICOLON);

AssignmentStatementNode increment = parseAssignmentStatement();

expect(TokenType.COLON);

BlockNode body = parseBlock();

return new ForStatementNode(line, column, iterator, initialValue, condition, increment, body);

}

private ReturnStatementNode parseReturnStatement() {

int line = currentToken.line;

int column = currentToken.column;

expect(TokenType.RETURN);

ExpressionNode expression = null;

if (currentToken.type != TokenType.SEMICOLON &&

currentToken.type != TokenType.RIGHT_BRACE &&

currentToken.type != TokenType.EOF) {

expression = parseExpression();

}

expectOptionalSemicolon();

return new ReturnStatementNode(line, column, expression);

}

private StatementNode parseAssignmentOrExpressionStatement() {

int line = currentToken.line;

int column = currentToken.column;

if (currentToken.type == TokenType.IDENTIFIER) {

Token lookahead = lexer.peekToken();

if (lookahead.type == TokenType.EQUALS) {

return parseAssignmentStatement();

}

return parseExpressionStatement();

}

private AssignmentStatementNode parseAssignmentStatement() {

int line = currentToken.line;

int column = currentToken.column;

String identifier = expectIdentifier();

expect(TokenType.EQUALS);

ExpressionNode expression = parseExpression();

expectOptionalSemicolon();

return new AssignmentStatementNode(line, column, identifier, expression);

}

private ExpressionStatementNode parseExpressionStatement() {

int line = currentToken.line;

int column = currentToken.column;

ExpressionNode expression = parseExpression();

expectOptionalSemicolon();

return new ExpressionStatementNode(line, column, expression);

}

private ExpressionNode parseExpression() {

return parseLogicalOr();

}

private ExpressionNode parseLogicalOr() {

ExpressionNode left = parseLogicalAnd();

while (currentToken.type == TokenType.OR) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseLogicalAnd();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseLogicalAnd() {

ExpressionNode left = parseEquality();

while (currentToken.type == TokenType.AND) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseEquality();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseEquality() {

ExpressionNode left = parseRelational();

while (currentToken.type == TokenType.EQUAL_EQUAL || currentToken.type == TokenType.NOT_EQUAL) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseRelational();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseRelational() {

ExpressionNode left = parseAdditive();

while (currentToken.type == TokenType.LESS_THAN ||

currentToken.type == TokenType.LESS_EQUAL ||

currentToken.type == TokenType.GREATER_THAN ||

currentToken.type == TokenType.GREATER_EQUAL) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseAdditive();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseAdditive() {

ExpressionNode left = parseMultiplicative();

while (currentToken.type == TokenType.PLUS || currentToken.type == TokenType.MINUS) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseMultiplicative();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseMultiplicative() {

ExpressionNode left = parseUnary();

while (currentToken.type == TokenType.MULTIPLY ||

currentToken.type == TokenType.DIVIDE ||

currentToken.type == TokenType.MODULO) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode right = parseUnary();

left = new BinaryExpressionNode(line, column, left, operator, right);

}

return left;

}

private ExpressionNode parseUnary() {

if (currentToken.type == TokenType.NOT || currentToken.type == TokenType.MINUS) {

int line = currentToken.line;

int column = currentToken.column;

String operator = currentToken.text;

advance();

ExpressionNode operand = parseUnary();

return new UnaryExpressionNode(line, column, operator, operand);

}

return parsePrimary();

}

private ExpressionNode parsePrimary() {

int line = currentToken.line;

int column = currentToken.column;

switch (currentToken.type) {

case INTEGER:

int intValue = Integer.parseInt(currentToken.text);

advance();

return new LiteralExpressionNode(line, column, intValue, "int");

case FLOAT:

double floatValue = Double.parseDouble(currentToken.text);

advance();

return new LiteralExpressionNode(line, column, floatValue, "float");

case STRING:

String stringValue = currentToken.text.substring(1, currentToken.text.length() - 1);

stringValue = processEscapeSequences(stringValue);

advance();

return new LiteralExpressionNode(line, column, stringValue, "string");

case TRUE:

advance();

return new LiteralExpressionNode(line, column, true, "bool");

case FALSE:

advance();

return new LiteralExpressionNode(line, column, false, "bool");

case IDENTIFIER:

String identifier = currentToken.text;

advance();

if (currentToken.type == TokenType.LEFT_PAREN) {

return parseFunctionCall(identifier, line, column);

} else {

return new IdentifierExpressionNode(line, column, identifier);

}

case LEFT_PAREN:

advance();

ExpressionNode expression = parseExpression();

expect(TokenType.RIGHT_PAREN);

return expression;

default:

error("Expected expression");

return new LiteralExpressionNode(line, column, 0, "int");

}

private FunctionCallExpressionNode parseFunctionCall(String functionName, int line, int column) {

expect(TokenType.LEFT_PAREN);

FunctionCallExpressionNode functionCall = new FunctionCallExpressionNode(line, column, functionName);

if (currentToken.type != TokenType.RIGHT_PAREN) {

functionCall.addArgument(parseExpression());

while (currentToken.type == TokenType.COMMA) {

advance();

functionCall.addArgument(parseExpression());

}

expect(TokenType.RIGHT_PAREN);

return functionCall;

}

private String processEscapeSequences(String str) {

StringBuilder result = new StringBuilder();

for (int i = 0; i < str.length(); i++) {

char c = str.charAt(i);

if (c == '\\' && i + 1 < str.length()) {

char next = str.charAt(i + 1);

switch (next) {

case 'n':

result.append('\n');

break;

case 't':

result.append('\t');

break;

case 'r':

result.append('\r');

break;

case '\\':

result.append('\\');

break;

case '"':

result.append('"');

break;

default:

result.append(c);

result.append(next);

break;

}

i++;

} else {

result.append(c);

}

return result.toString();

}

private void advance() {

if (currentToken.type != TokenType.EOF) {

currentToken = lexer.nextToken();

}

private void expect(TokenType expectedType) {

if (currentToken.type == expectedType) {

advance();

} else {

error("Expected " + expectedType + " but found " + currentToken.type);

}

private String expectIdentifier() {

if (currentToken.type == TokenType.IDENTIFIER) {

String value = currentToken.text;

advance();

return value;

} else {

error("Expected identifier but found " + currentToken.type);

return "";

}

private void expectOptionalSemicolon() {

if (currentToken.type == TokenType.SEMICOLON) {

advance();

}

private void error(String message) {

throw new ParseException(message);

}

private void synchronize() {

advance();

while (currentToken.type != TokenType.EOF) {

if (currentToken.type == TokenType.SEMICOLON) {

advance();

return;

}

switch (currentToken.type) {

case FUNC:

case VAR:

case IF:

case WHILE:

case FOR:

case RETURN:

return;

}

advance();

}

public static class ParseResult {

private ProgramNode program;

private List<ParseError> errors;

public ParseResult(ProgramNode program, List<ParseError> errors) {

this.program = program;

this.errors = new ArrayList<>(errors);

}

public ProgramNode getProgram() { return program; }

public List<ParseError> getErrors() { return new ArrayList<>(errors); }

public boolean hasErrors() { return !errors.isEmpty(); }

public boolean isSuccessful() { return program != null && errors.isEmpty(); }

}

public static class ParseError {

private int line;

private int column;

private String message;

public ParseError(int line, int column, String message) {

this.line = line;

this.column = column;

this.message = message;

}

public int getLine() { return line; }

public int getColumn() { return column; }

public String getMessage() { return message; }

@Override

public String toString() {

return "Parse error at line " + line + ", column " + column + ": " + message;

}

public static class ParseException extends RuntimeException {

public ParseException(String message) {

super(message);

}

This complete recursive descent parser implementation demonstrates all the concepts discussed in this article. The parser handles the full PyGo language syntax while providing robust error handling and recovery. The implementation shows how recursive descent parsing naturally translates grammar rules into parsing functions and how the technique can be used to build practical parsers for real programming languages.

The parser integrates seamlessly with the lexical analyzer and produces abstract syntax trees that can be used by subsequent compiler phases. The error handling and recovery mechanisms ensure that the parser can continue processing input even when syntax errors are encountered, allowing it to find and report multiple errors in a single compilation pass.

This ends the series about Compiler Construction. We‘ve learned about lexers and parsers built using ANTLR v4, parse trees and abstract syntax trees, code generation backends (LLVM), optimizers, interpreters and recursive decent parsing as an alternative to ANTLR-v4-generated parsers. Play around with these options, e.g., by extending PyGo, and do not forget to get and read the Dragon Book I‘ve recommended at the end of article 1.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Tuesday, February 10, 2026

COMPILER CONSTRUCTION SERIES: BUILDING A PYGO COMPILER - ARTICLE 7: IMPLEMENTING A RECURSIVE DESCENT PARSER FOR PYGO

INTRODUCTION TO RECURSIVE DESCENT PARSING

GRAMMAR ANALYSIS AND TRANSFORMATION FOR RECURSIVE DESCENT

LOOKAHEAD AND PREDICTION IN RECURSIVE DESCENT PARSING

ERROR HANDLING AND RECOVERY STRATEGIES

ABSTRACT SYNTAX TREE CONSTRUCTION DURING PARSING

IMPLEMENTING PARSING FUNCTIONS FOR PYGO CONSTRUCTS

HANDLING OPERATOR PRECEDENCE AND ASSOCIATIVITY

PARSING CONTROL FLOW CONSTRUCTS

FUNCTION AND VARIABLE DECLARATION PARSING

INTEGRATION WITH LEXICAL ANALYSIS

COMPLETE RECURSIVE DESCENT PARSER IMPLEMENTATION

No comments:

About Me