Hitchhiker's Guide to AI, Software Architecture, and Everything Else: COMPILER CONSTRUCTION SERIES: BUILDING A PYGO COMPILER - ARTICLE 2: IMPLEMENTING THE PYGO LEXER WITH ANTLR V4

INTRODUCTION TO LEXICAL ANALYSIS

The lexical analysis phase transforms raw source code into a stream of tokens that represent the fundamental building blocks of the programming language. For PyGo, the lexer must recognize keywords, identifiers, operators, literals, and punctuation while handling whitespace and comments appropriately.

ANTLR v4 provides an excellent framework for implementing lexers through grammar-driven code generation. By defining lexical rules in ANTLR's grammar notation, we can automatically generate efficient lexer code that handles tokenization, error recovery, and token stream management.

The PyGo lexer must handle several categories of tokens including reserved keywords, user-defined identifiers, numeric and string literals, operators, delimiters, and special symbols. Each category requires specific recognition patterns and may involve complex state management for proper tokenization.

ANTLR V4 LEXER FUNDAMENTALS

ANTLR v4 uses a top-down approach to lexical analysis where lexer rules are defined using regular expression-like patterns. The generated lexer processes input characters sequentially, matching the longest possible token at each position according to the defined rules.

Lexer rules in ANTLR begin with uppercase letters and define how specific tokens should be recognized. The order of rules matters because ANTLR applies rules in the sequence they appear in the grammar file, with earlier rules taking precedence over later ones.

The ANTLR lexer generator creates efficient finite automata that can quickly identify tokens while providing robust error handling and recovery mechanisms. This approach ensures that the lexer can handle malformed input gracefully while providing meaningful error messages.

PYGO LEXER GRAMMAR SPECIFICATION

The complete PyGo lexer grammar defines all tokens needed for the language. We begin by creating the lexer grammar file that will serve as input to ANTLR's code generation process.

The keyword definitions must appear before the general identifier rule to ensure that reserved words are properly recognized as keywords rather than generic identifiers. ANTLR's longest-match principle ensures that complete keywords are matched before considering them as potential identifiers.

IDENTIFIER AND LITERAL TOKEN DEFINITIONS

Identifiers in PyGo follow standard programming language conventions, beginning with a letter or underscore and continuing with letters, digits, or underscores. The lexer must distinguish between keywords and user-defined identifiers.

// Identifiers - must come after keywords

IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*;

// Numeric literals

INTEGER : [0-9]+;

FLOAT : [0-9]+ '.' [0-9]+;

// String literals with escape sequence support

STRING : '"' (~["\r\n\\] | '\\' .)* '"';

The string literal rule handles escape sequences by allowing any character except quotes, carriage returns, newlines, or backslashes, while also permitting backslash-escaped character sequences. This approach provides basic string functionality while maintaining lexer simplicity.

OPERATOR AND DELIMITER TOKENS

PyGo includes standard arithmetic, comparison, and logical operators along with various delimiters for structuring code. Each operator and delimiter requires a specific token definition.

// Arithmetic operators

PLUS : '+';

MINUS : '-';

MULTIPLY : '*';

DIVIDE : '/';

MODULO : '%';

// Comparison operators

EQUALS : '=';

EQUAL_EQUAL : '==';

NOT_EQUAL : '!=';

LESS_THAN : '<';

LESS_EQUAL : '<=';

GREATER_THAN: '>';

GREATER_EQUAL: '>=';

// Delimiters and punctuation

COLON : ':';

SEMICOLON : ';';

COMMA : ',';

LEFT_PAREN : '(';

RIGHT_PAREN : ')';

LEFT_BRACE : '{';

RIGHT_BRACE : '}';

ARROW : '->';

The order of operator definitions matters particularly for operators that share common prefixes. The EQUAL_EQUAL token must be defined before EQUALS to ensure that double equals signs are correctly recognized as equality comparison rather than two separate assignment operators.

WHITESPACE AND COMMENT HANDLING

The lexer must handle whitespace and comments appropriately by recognizing them but not including them in the token stream that gets passed to the parser. ANTLR provides special channels for this purpose.

// Whitespace - skip completely

WHITESPACE : [ \t\r\n]+ -> skip;

// Line comments - skip completely

LINE_COMMENT: '#' ~[\r\n]* -> skip;

// Block comments - skip completely

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

The skip directive tells ANTLR to recognize these patterns but exclude the corresponding tokens from the main token stream. This approach keeps whitespace and comments from interfering with parsing while still allowing the lexer to handle them properly.

COMPLETE PYGO LEXER GRAMMAR

Here is the complete PyGo lexer grammar that combines all the token definitions into a cohesive specification:

lexer grammar PyGoLexer;

// Keywords - must come before IDENTIFIER

VAR : 'var';

FUNC : 'func';

IF : 'if';

ELSE : 'else';

WHILE : 'while';

FOR : 'for';

RETURN : 'return';

TRUE : 'true';

FALSE : 'false';

AND : 'and';

OR : 'or';

NOT : 'not';

PRINT : 'print';

INT_TYPE : 'int';

FLOAT_TYPE : 'float';

STRING_TYPE : 'string';

BOOL_TYPE : 'bool';

// Identifiers

IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*;

// Literals

INTEGER : [0-9]+;

FLOAT : [0-9]+ '.' [0-9]+;

STRING : '"' (~["\r\n\\] | '\\' .)* '"';

// Operators

PLUS : '+';

MINUS : '-';

MULTIPLY : '*';

DIVIDE : '/';

MODULO : '%';

EQUALS : '=';

EQUAL_EQUAL : '==';

NOT_EQUAL : '!=';

LESS_THAN : '<';

LESS_EQUAL : '<=';

GREATER_THAN: '>';

GREATER_EQUAL: '>=';

// Delimiters

COLON : ':';

SEMICOLON : ';';

COMMA : ',';

LEFT_PAREN : '(';

RIGHT_PAREN : ')';

LEFT_BRACE : '{';

RIGHT_BRACE : '}';

ARROW : '->';

// Whitespace and comments

WHITESPACE : [ \t\r\n]+ -> skip;

LINE_COMMENT: '#' ~[\r\n]* -> skip;

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

GENERATING THE LEXER CODE

To generate the lexer implementation from the grammar specification, we use the ANTLR v4 tool with appropriate command-line options. The generation process creates several files that work together to provide lexical analysis functionality.

antlr4 -Dlanguage=Java PyGoLexer.g4

This command generates Java source files including PyGoLexer.java, which contains the main lexer implementation, and PyGoLexer.tokens, which defines token type constants that can be shared with the parser.

The generated lexer class extends ANTLR's base Lexer class and provides methods for tokenizing input streams. The lexer handles character-by-character processing while maintaining internal state to track line numbers, column positions, and current lexical context.

LEXER INTEGRATION AND TESTING

To integrate the generated lexer into our compiler infrastructure, we need to create wrapper classes that provide convenient interfaces for tokenization and error handling.

import org.antlr.v4.runtime.*;

import org.antlr.v4.runtime.tree.*;

import java.io.*;

import java.util.*;

public class PyGoLexerWrapper {

private PyGoLexer lexer;

private List<String> errors;

public PyGoLexerWrapper() {

this.errors = new ArrayList<>();

}

public List<Token> tokenize(String input) {

// Create input stream from string

ANTLRInputStream inputStream = new ANTLRInputStream(input);

// Create lexer instance

this.lexer = new PyGoLexer(inputStream);

// Add custom error listener

this.lexer.removeErrorListeners();

this.lexer.addErrorListener(new PyGoLexerErrorListener(this.errors));

// Collect all tokens

List<Token> tokens = new ArrayList<>();

Token token;

do {

token = this.lexer.nextToken();

tokens.add(token);

} while (token.getType() != Token.EOF);

return tokens;

}

public List<String> getErrors() {

return new ArrayList<>(this.errors);

}

public boolean hasErrors() {

return !this.errors.isEmpty();

}

The wrapper class provides a clean interface for tokenizing PyGo source code while collecting any lexical errors that occur during processing. The error handling mechanism allows the compiler to provide meaningful feedback about lexical problems.

CUSTOM ERROR HANDLING

Effective error handling in the lexer phase helps programmers quickly identify and fix problems in their source code. We implement a custom error listener that captures lexical errors with detailed location information.

import org.antlr.v4.runtime.*;

public class PyGoLexerErrorListener extends BaseErrorListener {

private List<String> errors;

public PyGoLexerErrorListener(List<String> errors) {

this.errors = errors;

}

@Override

public void syntaxError(Recognizer<?, ?> recognizer,

Object offendingSymbol,

int line,

int charPositionInLine,

String msg,

RecognitionException e) {

String errorMessage = String.format(

"Lexical error at line %d, column %d: %s",

line, charPositionInLine, msg

);

this.errors.add(errorMessage);

}

This error listener captures lexical errors and formats them with precise location information. The error messages include line numbers and column positions to help programmers locate problems in their source code quickly.

LEXER TESTING FRAMEWORK

To ensure the lexer works correctly, we need comprehensive testing that covers all token types and edge cases. The testing framework validates that the lexer produces expected token sequences for various input patterns.

import java.util.*;

public class PyGoLexerTest {

private PyGoLexerWrapper lexer;

public PyGoLexerTest() {

this.lexer = new PyGoLexerWrapper();

}

public void runAllTests() {

testKeywords();

testIdentifiers();

testLiterals();

testOperators();

testComments();

testComplexExpressions();

System.out.println("All lexer tests completed successfully");

}

private void testKeywords() {

String input = "var func if else while for return";

List<Token> tokens = this.lexer.tokenize(input);

// Verify expected token types

int[] expectedTypes = {

PyGoLexer.VAR, PyGoLexer.FUNC, PyGoLexer.IF,

PyGoLexer.ELSE, PyGoLexer.WHILE, PyGoLexer.FOR,

PyGoLexer.RETURN, PyGoLexer.EOF

};

for (int i = 0; i < expectedTypes.length; i++) {

assert tokens.get(i).getType() == expectedTypes[i];

}

private void testIdentifiers() {

String input = "variable_name _private_var userName count123";

List<Token> tokens = this.lexer.tokenize(input);

// All should be IDENTIFIER tokens

for (int i = 0; i < tokens.size() - 1; i++) {

assert tokens.get(i).getType() == PyGoLexer.IDENTIFIER;

}

private void testLiterals() {

String input = "42 3.14159 \"hello world\" true false";

List<Token> tokens = this.lexer.tokenize(input);

int[] expectedTypes = {

PyGoLexer.INTEGER, PyGoLexer.FLOAT, PyGoLexer.STRING,

PyGoLexer.TRUE, PyGoLexer.FALSE, PyGoLexer.EOF

};

for (int i = 0; i < expectedTypes.length; i++) {

assert tokens.get(i).getType() == expectedTypes[i];

}

private void testOperators() {

String input = "+ - * / == != <= >= = < >";

List<Token> tokens = this.lexer.tokenize(input);

int[] expectedTypes = {

PyGoLexer.PLUS, PyGoLexer.MINUS, PyGoLexer.MULTIPLY,

PyGoLexer.DIVIDE, PyGoLexer.EQUAL_EQUAL, PyGoLexer.NOT_EQUAL,

PyGoLexer.LESS_EQUAL, PyGoLexer.GREATER_EQUAL, PyGoLexer.EQUALS,

PyGoLexer.LESS_THAN, PyGoLexer.GREATER_THAN, PyGoLexer.EOF

};

for (int i = 0; i < expectedTypes.length; i++) {

assert tokens.get(i).getType() == expectedTypes[i];

}

private void testComments() {

String input = "var x # this is a comment\n/* block comment */ var y";

List<Token> tokens = this.lexer.tokenize(input);

// Comments should be skipped, only VAR, IDENTIFIER tokens remain

int[] expectedTypes = {

PyGoLexer.VAR, PyGoLexer.IDENTIFIER,

PyGoLexer.VAR, PyGoLexer.IDENTIFIER, PyGoLexer.EOF

};

for (int i = 0; i < expectedTypes.length; i++) {

assert tokens.get(i).getType() == expectedTypes[i];

}

private void testComplexExpressions() {

String input = "func calculate(x: int, y: float) -> float:";

List<Token> tokens = this.lexer.tokenize(input);

int[] expectedTypes = {

PyGoLexer.FUNC, PyGoLexer.IDENTIFIER, PyGoLexer.LEFT_PAREN,

PyGoLexer.IDENTIFIER, PyGoLexer.COLON, PyGoLexer.INT_TYPE,

PyGoLexer.COMMA, PyGoLexer.IDENTIFIER, PyGoLexer.COLON,

PyGoLexer.FLOAT_TYPE, PyGoLexer.RIGHT_PAREN, PyGoLexer.ARROW,

PyGoLexer.FLOAT_TYPE, PyGoLexer.COLON, PyGoLexer.EOF

};

for (int i = 0; i < expectedTypes.length; i++) {

assert tokens.get(i).getType() == expectedTypes[i];

}

LEXER PERFORMANCE CONSIDERATIONS

The generated ANTLR lexer provides excellent performance for most use cases, but understanding its behavior helps optimize compilation speed for large source files. The lexer processes input characters sequentially using finite automata, providing linear time complexity for tokenization.

Memory usage scales with input size since the lexer maintains internal buffers for character processing and token creation. For very large files, streaming approaches can reduce memory consumption while maintaining good performance characteristics.

The lexer's error recovery mechanisms ensure robust handling of malformed input without catastrophic failures. When encountering invalid characters or incomplete tokens, the lexer generates appropriate error messages and continues processing to find additional problems.

INTEGRATION WITH COMPILER PIPELINE

The PyGo lexer integrates seamlessly with the overall compiler architecture by providing a clean token stream interface that the parser can consume. The lexer handles all low-level character processing details while exposing only the essential token information needed for syntax analysis.

Token objects include type information, text content, and position data that enables precise error reporting throughout the compilation process. This information proves invaluable during parsing and semantic analysis phases where location-specific error messages significantly improve the development experience.

The lexer's error handling integrates with the compiler's overall error management system, allowing consistent error reporting across all compilation phases. This unified approach ensures that programmers receive coherent feedback regardless of where problems occur in their source code.

CONCLUSION OF ARTICLE 2

This article has demonstrated the complete implementation of a PyGo lexer using ANTLR v4. The lexer handles all PyGo tokens including keywords, identifiers, literals, operators, and delimiters while providing robust error handling and recovery mechanisms.

The ANTLR-generated lexer provides excellent performance and maintainability compared to hand-written alternatives. The grammar-driven approach ensures consistency and makes it easy to modify the lexer as the PyGo language evolves.

The testing framework validates lexer correctness across various input patterns and edge cases. Comprehensive testing ensures that the lexer reliably processes PyGo source code and provides meaningful error messages for invalid input.

In Article 3, we will build upon this lexer foundation to implement a complete PyGo parser using ANTLR v4. The parser will consume the token stream produced by this lexer and construct Abstract Syntax Trees that represent the structure of PyGo programs.

Hitchhiker's Guide to AI, Software Architecture, and Everything Else

Thursday, February 05, 2026

COMPILER CONSTRUCTION SERIES: BUILDING A PYGO COMPILER - ARTICLE 2: IMPLEMENTING THE PYGO LEXER WITH ANTLR V4

INTRODUCTION TO LEXICAL ANALYSIS

ANTLR V4 LEXER FUNDAMENTALS

PYGO LEXER GRAMMAR SPECIFICATION

IDENTIFIER AND LITERAL TOKEN DEFINITIONS

OPERATOR AND DELIMITER TOKENS

WHITESPACE AND COMMENT HANDLING

COMPLETE PYGO LEXER GRAMMAR

GENERATING THE LEXER CODE

LEXER INTEGRATION AND TESTING

CUSTOM ERROR HANDLING

LEXER TESTING FRAMEWORK

LEXER PERFORMANCE CONSIDERATIONS

INTEGRATION WITH COMPILER PIPELINE

CONCLUSION OF ARTICLE 2

No comments:

About Me