INTRODUCTION TO SYNTAX ANALYSIS
The syntax analysis phase transforms the token stream produced by the lexer into an Abstract Syntax Tree (AST) that represents the hierarchical structure of the program. The parser verifies that the token sequence conforms to PyGo's grammar rules while building a tree structure that captures the relationships between different language constructs.
ANTLR v4 provides powerful parsing capabilities through its adaptive LL(*) algorithm, which can handle complex grammars with left recursion and ambiguity resolution. The parser generator creates efficient parsing code that includes sophisticated error recovery mechanisms and detailed error reporting.
For PyGo, the parser must handle function definitions, variable declarations, control flow statements, expressions with proper operator precedence, and type annotations. The resulting AST serves as the foundation for semantic analysis and code generation in subsequent compiler phases.
PARSER GRAMMAR FUNDAMENTALS
Parser rules in ANTLR begin with lowercase letters and define the syntactic structure of language constructs. These rules reference both lexer tokens and other parser rules to build hierarchical grammar specifications that capture the complete syntax of the programming language.
The PyGo parser grammar builds upon the lexer tokens defined in Article 2, combining them into meaningful syntactic patterns. Parser rules use ANTLR's extended Backus-Naur Form notation with additional features for handling precedence, associativity, and semantic actions.
ANTLR's parsing approach generates recursive descent parsers with unlimited lookahead capabilities. This allows the parser to handle complex syntactic constructs while providing excellent error recovery and meaningful error messages for malformed input.
PYGO PARSER GRAMMAR SPECIFICATION
The complete PyGo parser grammar defines the syntactic structure of all language constructs. We begin with the top-level program structure and work down to individual expressions and statements.
parser grammar PyGoParser;
options {
tokenVocab = PyGoLexer;
}
// Top-level program structure
program
: (functionDeclaration | variableDeclaration)* EOF
;
// Function declarations
functionDeclaration
: FUNC IDENTIFIER LEFT_PAREN parameterList? RIGHT_PAREN
(ARROW type)? COLON block
;
parameterList
: parameter (COMMA parameter)*
;
parameter
: IDENTIFIER COLON type
;
// Variable declarations
variableDeclaration
: VAR IDENTIFIER COLON type (EQUALS expression)? SEMICOLON?
;
The program rule serves as the entry point for parsing complete PyGo source files. It allows any number of function and variable declarations at the top level, reflecting PyGo's simple module structure.
Function declarations include optional parameter lists and return types, with the function body represented as a block statement. The parameter list uses standard comma-separated syntax with explicit type annotations for each parameter.
TYPE SYSTEM GRAMMAR RULES
PyGo's type system requires grammar rules that handle the basic data types supported by the language. The type rules provide the foundation for static type checking in later compilation phases.
// Type specifications
type
: INT_TYPE
| FLOAT_TYPE
| STRING_TYPE
| BOOL_TYPE
;
The type rule currently handles only primitive types, but the grammar structure allows for easy extension to support more complex types such as arrays, structures, or function types in future language versions.
STATEMENT GRAMMAR RULES
Statements form the core of PyGo's imperative programming model. The parser must handle various statement types including assignments, control flow, function calls, and return statements.
// Statement rules
block
: LEFT_BRACE statement* RIGHT_BRACE
;
statement
: variableDeclaration
| assignmentStatement
| ifStatement
| whileStatement
| forStatement
| returnStatement
| expressionStatement
;
assignmentStatement
: IDENTIFIER EQUALS expression SEMICOLON?
;
ifStatement
: IF expression COLON block (ELSE COLON block)?
;
whileStatement
: WHILE expression COLON block
;
forStatement
: FOR IDENTIFIER EQUALS expression SEMICOLON expression SEMICOLON
assignmentStatement COLON block
;
returnStatement
: RETURN expression? SEMICOLON?
;
expressionStatement
: expression SEMICOLON?
;
The block rule defines code blocks using brace delimiters, providing clear scope boundaries for variable declarations and control flow structures. Statement rules handle the various imperative constructs available in PyGo.
The if statement includes optional else clauses, while the for statement follows a traditional three-part structure with initialization, condition, and increment components. Return statements allow optional expressions for functions that return values.
EXPRESSION GRAMMAR WITH PRECEDENCE
Expression parsing requires careful handling of operator precedence and associativity to ensure that mathematical and logical expressions are parsed correctly according to standard conventions.
// Expression rules with precedence
expression
: orExpression
;
orExpression
: andExpression (OR andExpression)*
;
andExpression
: equalityExpression (AND equalityExpression)*
;
equalityExpression
: relationalExpression ((EQUAL_EQUAL | NOT_EQUAL) relationalExpression)*
;
relationalExpression
: additiveExpression ((LESS_THAN | LESS_EQUAL | GREATER_THAN | GREATER_EQUAL) additiveExpression)*
;
additiveExpression
: multiplicativeExpression ((PLUS | MINUS) multiplicativeExpression)*
;
multiplicativeExpression
: unaryExpression ((MULTIPLY | DIVIDE | MODULO) unaryExpression)*
;
unaryExpression
: (NOT | MINUS) unaryExpression
| primaryExpression
;
primaryExpression
: INTEGER
| FLOAT
| STRING
| TRUE
| FALSE
| IDENTIFIER
| functionCall
| LEFT_PAREN expression RIGHT_PAREN
;
functionCall
: IDENTIFIER LEFT_PAREN argumentList? RIGHT_PAREN
;
argumentList
: expression (COMMA expression)*
;
The expression grammar uses left-recursive rules to handle operator precedence correctly. Logical OR has the lowest precedence, followed by AND, equality operators, relational operators, addition and subtraction, multiplication and division, and finally unary operators with the highest precedence.
Primary expressions include literals, identifiers, function calls, and parenthesized expressions. This structure ensures that complex expressions are parsed according to standard mathematical conventions while maintaining clear grammar rules.
COMPLETE PYGO PARSER GRAMMAR
Here is the complete PyGo parser grammar that combines all the syntactic rules into a cohesive specification:
parser grammar PyGoParser;
options {
tokenVocab = PyGoLexer;
}
// Top-level program structure
program
: (functionDeclaration | variableDeclaration)* EOF
;
// Function declarations
functionDeclaration
: FUNC IDENTIFIER LEFT_PAREN parameterList? RIGHT_PAREN
(ARROW type)? COLON block
;
parameterList
: parameter (COMMA parameter)*
;
parameter
: IDENTIFIER COLON type
;
// Variable declarations
variableDeclaration
: VAR IDENTIFIER COLON type (EQUALS expression)? SEMICOLON?
;
// Type specifications
type
: INT_TYPE
| FLOAT_TYPE
| STRING_TYPE
| BOOL_TYPE
;
// Statement rules
block
: LEFT_BRACE statement* RIGHT_BRACE
;
statement
: variableDeclaration
| assignmentStatement
| ifStatement
| whileStatement
| forStatement
| returnStatement
| expressionStatement
;
assignmentStatement
: IDENTIFIER EQUALS expression SEMICOLON?
;
ifStatement
: IF expression COLON block (ELSE COLON block)?
;
whileStatement
: WHILE expression COLON block
;
forStatement
: FOR IDENTIFIER EQUALS expression SEMICOLON expression SEMICOLON
assignmentStatement COLON block
;
returnStatement
: RETURN expression? SEMICOLON?
;
expressionStatement
: expression SEMICOLON?
;
// Expression rules with precedence
expression
: orExpression
;
orExpression
: andExpression (OR andExpression)*
;
andExpression
: equalityExpression (AND equalityExpression)*
;
equalityExpression
: relationalExpression ((EQUAL_EQUAL | NOT_EQUAL) relationalExpression)*
;
relationalExpression
: additiveExpression ((LESS_THAN | LESS_EQUAL | GREATER_THAN | GREATER_EQUAL) additiveExpression)*
;
additiveExpression
: multiplicativeExpression ((PLUS | MINUS) multiplicativeExpression)*
;
multiplicativeExpression
: unaryExpression ((MULTIPLY | DIVIDE | MODULO) unaryExpression)*
;
unaryExpression
: (NOT | MINUS) unaryExpression
| primaryExpression
;
primaryExpression
: INTEGER
| FLOAT
| STRING
| TRUE
| FALSE
| IDENTIFIER
| functionCall
| LEFT_PAREN expression RIGHT_PAREN
;
functionCall
: IDENTIFIER LEFT_PAREN argumentList? RIGHT_PAREN
;
argumentList
: expression (COMMA expression)*
;
GENERATING THE PARSER CODE
To generate the parser implementation from the grammar specification, we use the ANTLR v4 tool with the same command-line options used for the lexer. The generation process creates several files that provide parsing functionality.
antlr4 -Dlanguage=Java PyGoParser.g4
This command generates Java source files including PyGoParser.java, which contains the main parser implementation, and various context classes that represent different grammar rules. The generated parser integrates seamlessly with the lexer created in Article 2.
The generated parser class extends ANTLR's base Parser class and provides methods for parsing token streams according to the defined grammar rules. Each grammar rule corresponds to a parsing method that returns a parse tree context object.
ABSTRACT SYNTAX TREE DESIGN
While ANTLR generates parse trees automatically, we need to design Abstract Syntax Tree classes that provide a cleaner representation of the program structure for subsequent compilation phases. The AST classes capture the essential information from parse trees while eliminating unnecessary syntactic details.
// Base AST node class
public abstract class ASTNode {
protected int lineNumber;
protected int columnNumber;
public ASTNode(int lineNumber, int columnNumber) {
this.lineNumber = lineNumber;
this.columnNumber = columnNumber;
}
public int getLineNumber() { return lineNumber; }
public int getColumnNumber() { return columnNumber; }
public abstract void accept(ASTVisitor visitor);
}
// Program node representing the entire source file
public class ProgramNode extends ASTNode {
private List<DeclarationNode> declarations;
public ProgramNode(int lineNumber, int columnNumber) {
super(lineNumber, columnNumber);
this.declarations = new ArrayList<>();
}
public void addDeclaration(DeclarationNode declaration) {
this.declarations.add(declaration);
}
public List<DeclarationNode> getDeclarations() {
return new ArrayList<>(declarations);
}
@Override
public void accept(ASTVisitor visitor) {
visitor.visitProgram(this);
}
}
// Base class for all declarations
public abstract class DeclarationNode extends ASTNode {
protected String identifier;
public DeclarationNode(int lineNumber, int columnNumber, String identifier) {
super(lineNumber, columnNumber);
this.identifier = identifier;
}
public String getIdentifier() { return identifier; }
}
// Function declaration node
public class FunctionDeclarationNode extends DeclarationNode {
private List<ParameterNode> parameters;
private TypeNode returnType;
private BlockNode body;
public FunctionDeclarationNode(int lineNumber, int columnNumber,
String identifier) {
super(lineNumber, columnNumber, identifier);
this.parameters = new ArrayList<>();
}
public void addParameter(ParameterNode parameter) {
this.parameters.add(parameter);
}
public void setReturnType(TypeNode returnType) {
this.returnType = returnType;
}
public void setBody(BlockNode body) {
this.body = body;
}
public List<ParameterNode> getParameters() {
return new ArrayList<>(parameters);
}
public TypeNode getReturnType() { return returnType; }
public BlockNode getBody() { return body; }
@Override
public void accept(ASTVisitor visitor) {
visitor.visitFunctionDeclaration(this);
}
}
The AST design uses the Visitor pattern to enable different operations on the tree structure without modifying the node classes themselves. This approach provides flexibility for implementing various compiler phases that need to traverse and analyze the AST.
PARSE TREE TO AST CONVERSION
Converting ANTLR's parse trees to our custom AST representation requires implementing a visitor that traverses the parse tree and constructs corresponding AST nodes. This conversion process eliminates syntactic noise while preserving semantic information.
import org.antlr.v4.runtime.tree.*;
public class PyGoASTBuilder extends PyGoParserBaseVisitor<ASTNode> {
@Override
public ASTNode visitProgram(PyGoParser.ProgramContext ctx) {
ProgramNode program = new ProgramNode(
ctx.getStart().getLine(),
ctx.getStart().getCharPositionInLine()
);
for (PyGoParser.FunctionDeclarationContext funcCtx : ctx.functionDeclaration()) {
FunctionDeclarationNode funcNode =
(FunctionDeclarationNode) visitFunctionDeclaration(funcCtx);
program.addDeclaration(funcNode);
}
for (PyGoParser.VariableDeclarationContext varCtx : ctx.variableDeclaration()) {
VariableDeclarationNode varNode =
(VariableDeclarationNode) visitVariableDeclaration(varCtx);
program.addDeclaration(varNode);
}
return program;
}
@Override
public ASTNode visitFunctionDeclaration(PyGoParser.FunctionDeclarationContext ctx) {
String functionName = ctx.IDENTIFIER().getText();
FunctionDeclarationNode funcNode = new FunctionDeclarationNode(
ctx.getStart().getLine(),
ctx.getStart().getCharPositionInLine(),
functionName
);
// Process parameters if present
if (ctx.parameterList() != null) {
for (PyGoParser.ParameterContext paramCtx : ctx.parameterList().parameter()) {
ParameterNode paramNode = (ParameterNode) visitParameter(paramCtx);
funcNode.addParameter(paramNode);
}
}
// Process return type if present
if (ctx.type() != null) {
TypeNode returnType = (TypeNode) visitType(ctx.type());
funcNode.setReturnType(returnType);
}
// Process function body
BlockNode body = (BlockNode) visitBlock(ctx.block());
funcNode.setBody(body);
return funcNode;
}
@Override
public ASTNode visitVariableDeclaration(PyGoParser.VariableDeclarationContext ctx) {
String varName = ctx.IDENTIFIER().getText();
TypeNode varType = (TypeNode) visitType(ctx.type());
VariableDeclarationNode varNode = new VariableDeclarationNode(
ctx.getStart().getLine(),
ctx.getStart().getCharPositionInLine(),
varName,
varType
);
// Process initialization expression if present
if (ctx.expression() != null) {
ExpressionNode initExpr = (ExpressionNode) visitExpression(ctx.expression());
varNode.setInitializer(initExpr);
}
return varNode;
}
@Override
public ASTNode visitBlock(PyGoParser.BlockContext ctx) {
BlockNode block = new BlockNode(
ctx.getStart().getLine(),
ctx.getStart().getCharPositionInLine()
);
for (PyGoParser.StatementContext stmtCtx : ctx.statement()) {
StatementNode stmt = (StatementNode) visitStatement(stmtCtx);
block.addStatement(stmt);
}
return block;
}
}
The AST builder visitor traverses the parse tree depth-first, constructing AST nodes that correspond to each grammar rule. The visitor extracts essential information such as identifiers, types, and expressions while discarding syntactic tokens that are not needed for semantic analysis.
COMPREHENSIVE ERROR HANDLING
Effective error handling in the parser phase provides programmers with detailed information about syntax errors, including precise locations and suggestions for fixes. We implement custom error handling that integrates with ANTLR's error recovery mechanisms.
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.atn.*;
import java.util.*;
public class PyGoParserErrorListener extends BaseErrorListener {
private List<ParseError> errors;
public PyGoParserErrorListener() {
this.errors = new ArrayList<>();
}
@Override
public void syntaxError(Recognizer<?, ?> recognizer,
Object offendingSymbol,
int line,
int charPositionInLine,
String msg,
RecognitionException e) {
String errorMessage = formatErrorMessage(
(Token) offendingSymbol, line, charPositionInLine, msg, e
);
ParseError error = new ParseError(
line, charPositionInLine, errorMessage, generateSuggestion(e)
);
this.errors.add(error);
}
private String formatErrorMessage(Token offendingToken, int line,
int column, String msg, RecognitionException e) {
StringBuilder sb = new StringBuilder();
sb.append(String.format("Syntax error at line %d, column %d: ", line, column));
if (offendingToken != null) {
String tokenText = offendingToken.getText();
if (tokenText.equals("<EOF>")) {
sb.append("Unexpected end of file");
} else {
sb.append(String.format("Unexpected token '%s'", tokenText));
}
} else {
sb.append(msg);
}
return sb.toString();
}
private String generateSuggestion(RecognitionException e) {
if (e instanceof NoViableAltException) {
return "Check for missing keywords, operators, or punctuation";
} else if (e instanceof InputMismatchException) {
InputMismatchException ime = (InputMismatchException) e;
if (ime.getExpectedTokens() != null) {
return "Expected one of: " + ime.getExpectedTokens().toString();
}
}
return "Check syntax according to PyGo language specification";
}
public List<ParseError> getErrors() {
return new ArrayList<>(errors);
}
public boolean hasErrors() {
return !errors.isEmpty();
}
}
// Error information class
public class ParseError {
private int line;
private int column;
private String message;
private String suggestion;
public ParseError(int line, int column, String message, String suggestion) {
this.line = line;
this.column = column;
this.message = message;
this.suggestion = suggestion;
}
public int getLine() { return line; }
public int getColumn() { return column; }
public String getMessage() { return message; }
public String getSuggestion() { return suggestion; }
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append(message);
if (suggestion != null && !suggestion.isEmpty()) {
sb.append("\n Suggestion: ").append(suggestion);
}
return sb.toString();
}
}
The error handling system captures detailed information about syntax errors and provides helpful suggestions for fixing common problems. The error messages include precise location information and context-specific guidance.
PARSER INTEGRATION AND TESTING
To integrate the parser with the overall compiler infrastructure, we create a wrapper class that combines lexing and parsing while providing comprehensive error handling and AST generation.
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
import java.util.*;
public class PyGoParserWrapper {
private List<ParseError> errors;
private PyGoASTBuilder astBuilder;
public PyGoParserWrapper() {
this.errors = new ArrayList<>();
this.astBuilder = new PyGoASTBuilder();
}
public ParseResult parse(String input) {
// Create lexer
ANTLRInputStream inputStream = new ANTLRInputStream(input);
PyGoLexer lexer = new PyGoLexer(inputStream);
// Add lexer error handling
lexer.removeErrorListeners();
PyGoLexerErrorListener lexerErrorListener = new PyGoLexerErrorListener();
lexer.addErrorListener(lexerErrorListener);
// Create token stream
CommonTokenStream tokens = new CommonTokenStream(lexer);
// Create parser
PyGoParser parser = new PyGoParser(tokens);
// Add parser error handling
parser.removeErrorListeners();
PyGoParserErrorListener parserErrorListener = new PyGoParserErrorListener();
parser.addErrorListener(parserErrorListener);
// Parse the input
PyGoParser.ProgramContext parseTree = parser.program();
// Collect all errors
List<ParseError> allErrors = new ArrayList<>();
allErrors.addAll(lexerErrorListener.getErrors());
allErrors.addAll(parserErrorListener.getErrors());
// Build AST if no errors
ProgramNode ast = null;
if (allErrors.isEmpty()) {
ast = (ProgramNode) astBuilder.visitProgram(parseTree);
}
return new ParseResult(ast, parseTree, allErrors);
}
}
// Result class for parsing operations
public class ParseResult {
private ProgramNode ast;
private ParseTree parseTree;
private List<ParseError> errors;
public ParseResult(ProgramNode ast, ParseTree parseTree, List<ParseError> errors) {
this.ast = ast;
this.parseTree = parseTree;
this.errors = new ArrayList<>(errors);
}
public ProgramNode getAST() { return ast; }
public ParseTree getParseTree() { return parseTree; }
public List<ParseError> getErrors() { return new ArrayList<>(errors); }
public boolean hasErrors() { return !errors.isEmpty(); }
public boolean isSuccessful() { return ast != null && errors.isEmpty(); }
}
COMPREHENSIVE PARSER TESTING
Testing the parser requires validating that it correctly handles various PyGo language constructs while providing meaningful error messages for invalid input. The testing framework covers both positive and negative test cases.
public class PyGoParserTest {
private PyGoParserWrapper parser;
public PyGoParserTest() {
this.parser = new PyGoParserWrapper();
}
public void runAllTests() {
testValidPrograms();
testSyntaxErrors();
testComplexExpressions();
testControlFlow();
testFunctionDeclarations();
System.out.println("All parser tests completed successfully");
}
private void testValidPrograms() {
String validProgram = """
func fibonacci(n: int) -> int:
{
if n <= 1:
{
return n
}
else:
{
return fibonacci(n - 1) + fibonacci(n - 2)
}
}
func main():
{
var result: int = fibonacci(10)
print(result)
}
""";
ParseResult result = parser.parse(validProgram);
assert result.isSuccessful() : "Valid program should parse successfully";
assert result.getAST() != null : "AST should be generated for valid program";
}
private void testSyntaxErrors() {
String invalidProgram = """
func broken_function(x: int -> int:
{
var y: int = x +
return y
}
""";
ParseResult result = parser.parse(invalidProgram);
assert result.hasErrors() : "Invalid program should produce errors";
assert !result.getErrors().isEmpty() : "Error list should not be empty";
// Verify error messages contain useful information
for (ParseError error : result.getErrors()) {
assert error.getLine() > 0 : "Error should have valid line number";
assert error.getMessage() != null : "Error should have message";
}
}
private void testComplexExpressions() {
String expressionProgram = """
func test_expressions():
{
var result: int = (5 + 3) * 2 - 1
var condition: bool = (result > 10) and (result < 20)
var complex: float = 3.14 * 2.0 + 1.5
}
""";
ParseResult result = parser.parse(expressionProgram);
assert result.isSuccessful() : "Expression program should parse successfully";
}
private void testControlFlow() {
String controlFlowProgram = """
func control_flow_test(n: int):
{
var i: int = 0
while i < n:
{
if i % 2 == 0:
{
print("even")
}
else:
{
print("odd")
}
i = i + 1
}
}
""";
ParseResult result = parser.parse(controlFlowProgram);
assert result.isSuccessful() : "Control flow program should parse successfully";
}
private void testFunctionDeclarations() {
String functionProgram = """
func no_params():
{
print("no parameters")
}
func with_params(x: int, y: float, name: string):
{
print(name)
}
func with_return(a: int, b: int) -> int:
{
return a + b
}
""";
ParseResult result = parser.parse(functionProgram);
assert result.isSuccessful() : "Function declaration program should parse successfully";
}
}
ERROR RECOVERY AND RESILIENCE
ANTLR's built-in error recovery mechanisms allow the parser to continue processing after encountering syntax errors, enabling detection of multiple problems in a single compilation pass. The parser attempts to resynchronize with the input stream and continue parsing subsequent constructs.
The error recovery strategy uses synchronization tokens such as semicolons and braces to find safe points for resuming parsing. This approach helps identify as many syntax errors as possible while maintaining reasonable parse tree structure for error-free portions of the code.
When errors occur, the parser generates error nodes in the parse tree that represent the problematic regions. These error nodes allow subsequent compilation phases to handle malformed input gracefully while still processing valid portions of the program.
PARSER PERFORMANCE OPTIMIZATION
The generated ANTLR parser provides excellent performance for typical PyGo programs, but understanding its characteristics helps optimize compilation speed for large codebases. The adaptive LL(*) algorithm provides linear parsing performance for most grammar constructs.
Memory usage scales with input complexity since the parser maintains call stacks for recursive descent and constructs parse trees that mirror the input structure. For very large files, streaming approaches and selective AST construction can reduce memory consumption.
The parser's lookahead capabilities ensure robust handling of ambiguous constructs without backtracking penalties. ANTLR's prediction analysis optimizes the generated parser to minimize lookahead requirements while maintaining correctness.
INTEGRATION WITH SEMANTIC ANALYSIS
The AST produced by the parser serves as input to the semantic analysis phase, which will be covered in detail when we discuss the compiler backend. The AST design provides clean interfaces for symbol table construction, type checking, and scope analysis.
Position information embedded in AST nodes enables precise error reporting during semantic analysis. This location data proves essential for generating helpful error messages that guide programmers to specific problems in their source code.
The visitor pattern used in the AST design facilitates implementation of various semantic analysis passes without modifying the core AST structure. This approach provides flexibility for implementing different analysis algorithms and optimization strategies.
CONCLUSION OF ARTICLE 3
This article has demonstrated the complete implementation of a PyGo parser using ANTLR v4, including comprehensive error handling and AST generation. The parser handles all PyGo language constructs while providing robust error recovery and meaningful error messages.
The grammar-driven approach using ANTLR ensures maintainability and correctness while providing excellent performance characteristics. The custom AST design creates a clean foundation for subsequent compilation phases.
The error handling system provides detailed feedback about syntax problems, including precise location information and helpful suggestions for fixes. The testing framework validates parser correctness across various input patterns and error conditions.
In Article 4, we will implement the PyGo compiler backend using LLVM for code generation. The backend will consume the AST produced by this parser and generate efficient machine code while performing semantic analysis and type checking.