Thursday, February 05, 2026

COMPILER CONSTRUCTION SERIES: BUILDING A PYGO COMPILER - ARTICLE 2: IMPLEMENTING THE PYGO LEXER WITH ANTLR V4



INTRODUCTION TO LEXICAL ANALYSIS


The lexical analysis phase transforms raw source code into a stream of tokens that represent the fundamental building blocks of the programming language. For PyGo, the lexer must recognize keywords, identifiers, operators, literals, and punctuation while handling whitespace and comments appropriately.


ANTLR v4 provides an excellent framework for implementing lexers through grammar-driven code generation. By defining lexical rules in ANTLR's grammar notation, we can automatically generate efficient lexer code that handles tokenization, error recovery, and token stream management.


The PyGo lexer must handle several categories of tokens including reserved keywords, user-defined identifiers, numeric and string literals, operators, delimiters, and special symbols. Each category requires specific recognition patterns and may involve complex state management for proper tokenization.


ANTLR V4 LEXER FUNDAMENTALS


ANTLR v4 uses a top-down approach to lexical analysis where lexer rules are defined using regular expression-like patterns. The generated lexer processes input characters sequentially, matching the longest possible token at each position according to the defined rules.


Lexer rules in ANTLR begin with uppercase letters and define how specific tokens should be recognized. The order of rules matters because ANTLR applies rules in the sequence they appear in the grammar file, with earlier rules taking precedence over later ones.


The ANTLR lexer generator creates efficient finite automata that can quickly identify tokens while providing robust error handling and recovery mechanisms. This approach ensures that the lexer can handle malformed input gracefully while providing meaningful error messages.


PYGO LEXER GRAMMAR SPECIFICATION


The complete PyGo lexer grammar defines all tokens needed for the language. We begin by creating the lexer grammar file that will serve as input to ANTLR's code generation process.


The keyword definitions must appear before the general identifier rule to ensure that reserved words are properly recognized as keywords rather than generic identifiers. ANTLR's longest-match principle ensures that complete keywords are matched before considering them as potential identifiers.


IDENTIFIER AND LITERAL TOKEN DEFINITIONS


Identifiers in PyGo follow standard programming language conventions, beginning with a letter or underscore and continuing with letters, digits, or underscores. The lexer must distinguish between keywords and user-defined identifiers.


    // Identifiers - must come after keywords

    IDENTIFIER  : [a-zA-Z_][a-zA-Z0-9_]*;


    // Numeric literals

    INTEGER     : [0-9]+;

    FLOAT       : [0-9]+ '.' [0-9]+;


    // String literals with escape sequence support

    STRING      : '"' (~["\r\n\\] | '\\' .)* '"';


The string literal rule handles escape sequences by allowing any character except quotes, carriage returns, newlines, or backslashes, while also permitting backslash-escaped character sequences. This approach provides basic string functionality while maintaining lexer simplicity.


OPERATOR AND DELIMITER TOKENS


PyGo includes standard arithmetic, comparison, and logical operators along with various delimiters for structuring code. Each operator and delimiter requires a specific token definition.


    // Arithmetic operators

    PLUS        : '+';

    MINUS       : '-';

    MULTIPLY    : '*';

    DIVIDE      : '/';

    MODULO      : '%';


    // Comparison operators

    EQUALS      : '=';

    EQUAL_EQUAL : '==';

    NOT_EQUAL   : '!=';

    LESS_THAN   : '<';

    LESS_EQUAL  : '<=';

    GREATER_THAN: '>';

    GREATER_EQUAL: '>=';


    // Delimiters and punctuation

    COLON       : ':';

    SEMICOLON   : ';';

    COMMA       : ',';

    LEFT_PAREN  : '(';

    RIGHT_PAREN : ')';

    LEFT_BRACE  : '{';

    RIGHT_BRACE : '}';

    ARROW       : '->';


The order of operator definitions matters particularly for operators that share common prefixes. The EQUAL_EQUAL token must be defined before EQUALS to ensure that double equals signs are correctly recognized as equality comparison rather than two separate assignment operators.


WHITESPACE AND COMMENT HANDLING


The lexer must handle whitespace and comments appropriately by recognizing them but not including them in the token stream that gets passed to the parser. ANTLR provides special channels for this purpose.


    // Whitespace - skip completely

    WHITESPACE  : [ \t\r\n]+ -> skip;


    // Line comments - skip completely  

    LINE_COMMENT: '#' ~[\r\n]* -> skip;


    // Block comments - skip completely

    BLOCK_COMMENT: '/*' .*? '*/' -> skip;


The skip directive tells ANTLR to recognize these patterns but exclude the corresponding tokens from the main token stream. This approach keeps whitespace and comments from interfering with parsing while still allowing the lexer to handle them properly.


COMPLETE PYGO LEXER GRAMMAR


Here is the complete PyGo lexer grammar that combines all the token definitions into a cohesive specification:


    lexer grammar PyGoLexer;


    // Keywords - must come before IDENTIFIER

    VAR         : 'var';

    FUNC        : 'func';

    IF          : 'if';

    ELSE        : 'else';

    WHILE       : 'while';

    FOR         : 'for';

    RETURN      : 'return';

    TRUE        : 'true';

    FALSE       : 'false';

    AND         : 'and';

    OR          : 'or';

    NOT         : 'not';

    PRINT       : 'print';

    INT_TYPE    : 'int';

    FLOAT_TYPE  : 'float';

    STRING_TYPE : 'string';

    BOOL_TYPE   : 'bool';


    // Identifiers

    IDENTIFIER  : [a-zA-Z_][a-zA-Z0-9_]*;


    // Literals

    INTEGER     : [0-9]+;

    FLOAT       : [0-9]+ '.' [0-9]+;

    STRING      : '"' (~["\r\n\\] | '\\' .)* '"';


    // Operators

    PLUS        : '+';

    MINUS       : '-';

    MULTIPLY    : '*';

    DIVIDE      : '/';

    MODULO      : '%';

    EQUALS      : '=';

    EQUAL_EQUAL : '==';

    NOT_EQUAL   : '!=';

    LESS_THAN   : '<';

    LESS_EQUAL  : '<=';

    GREATER_THAN: '>';

    GREATER_EQUAL: '>=';


    // Delimiters

    COLON       : ':';

    SEMICOLON   : ';';

    COMMA       : ',';

    LEFT_PAREN  : '(';

    RIGHT_PAREN : ')';

    LEFT_BRACE  : '{';

    RIGHT_BRACE : '}';

    ARROW       : '->';


    // Whitespace and comments

    WHITESPACE  : [ \t\r\n]+ -> skip;

    LINE_COMMENT: '#' ~[\r\n]* -> skip;

    BLOCK_COMMENT: '/*' .*? '*/' -> skip;


GENERATING THE LEXER CODE


To generate the lexer implementation from the grammar specification, we use the ANTLR v4 tool with appropriate command-line options. The generation process creates several files that work together to provide lexical analysis functionality.


    antlr4 -Dlanguage=Java PyGoLexer.g4


This command generates Java source files including PyGoLexer.java, which contains the main lexer implementation, and PyGoLexer.tokens, which defines token type constants that can be shared with the parser.


The generated lexer class extends ANTLR's base Lexer class and provides methods for tokenizing input streams. The lexer handles character-by-character processing while maintaining internal state to track line numbers, column positions, and current lexical context.


LEXER INTEGRATION AND TESTING


To integrate the generated lexer into our compiler infrastructure, we need to create wrapper classes that provide convenient interfaces for tokenization and error handling.


    import org.antlr.v4.runtime.*;

    import org.antlr.v4.runtime.tree.*;

    import java.io.*;

    import java.util.*;


    public class PyGoLexerWrapper {

        private PyGoLexer lexer;

        private List<String> errors;


        public PyGoLexerWrapper() {

            this.errors = new ArrayList<>();

        }


        public List<Token> tokenize(String input) {

            // Create input stream from string

            ANTLRInputStream inputStream = new ANTLRInputStream(input);

            

            // Create lexer instance

            this.lexer = new PyGoLexer(inputStream);

            

            // Add custom error listener

            this.lexer.removeErrorListeners();

            this.lexer.addErrorListener(new PyGoLexerErrorListener(this.errors));

            

            // Collect all tokens

            List<Token> tokens = new ArrayList<>();

            Token token;

            

            do {

                token = this.lexer.nextToken();

                tokens.add(token);

            } while (token.getType() != Token.EOF);

            

            return tokens;

        }


        public List<String> getErrors() {

            return new ArrayList<>(this.errors);

        }


        public boolean hasErrors() {

            return !this.errors.isEmpty();

        }

    }


The wrapper class provides a clean interface for tokenizing PyGo source code while collecting any lexical errors that occur during processing. The error handling mechanism allows the compiler to provide meaningful feedback about lexical problems.


CUSTOM ERROR HANDLING


Effective error handling in the lexer phase helps programmers quickly identify and fix problems in their source code. We implement a custom error listener that captures lexical errors with detailed location information.


    import org.antlr.v4.runtime.*;


    public class PyGoLexerErrorListener extends BaseErrorListener {

        private List<String> errors;


        public PyGoLexerErrorListener(List<String> errors) {

            this.errors = errors;

        }


        @Override

        public void syntaxError(Recognizer<?, ?> recognizer,

                              Object offendingSymbol,

                              int line,

                              int charPositionInLine,

                              String msg,

                              RecognitionException e) {

            

            String errorMessage = String.format(

                "Lexical error at line %d, column %d: %s",

                line, charPositionInLine, msg

            );

            

            this.errors.add(errorMessage);

        }

    }


This error listener captures lexical errors and formats them with precise location information. The error messages include line numbers and column positions to help programmers locate problems in their source code quickly.


LEXER TESTING FRAMEWORK


To ensure the lexer works correctly, we need comprehensive testing that covers all token types and edge cases. The testing framework validates that the lexer produces expected token sequences for various input patterns.


    import java.util.*;


    public class PyGoLexerTest {

        private PyGoLexerWrapper lexer;


        public PyGoLexerTest() {

            this.lexer = new PyGoLexerWrapper();

        }


        public void runAllTests() {

            testKeywords();

            testIdentifiers();

            testLiterals();

            testOperators();

            testComments();

            testComplexExpressions();

            

            System.out.println("All lexer tests completed successfully");

        }


        private void testKeywords() {

            String input = "var func if else while for return";

            List<Token> tokens = this.lexer.tokenize(input);

            

            // Verify expected token types

            int[] expectedTypes = {

                PyGoLexer.VAR, PyGoLexer.FUNC, PyGoLexer.IF,

                PyGoLexer.ELSE, PyGoLexer.WHILE, PyGoLexer.FOR,

                PyGoLexer.RETURN, PyGoLexer.EOF

            };

            

            for (int i = 0; i < expectedTypes.length; i++) {

                assert tokens.get(i).getType() == expectedTypes[i];

            }

        }


        private void testIdentifiers() {

            String input = "variable_name _private_var userName count123";

            List<Token> tokens = this.lexer.tokenize(input);

            

            // All should be IDENTIFIER tokens

            for (int i = 0; i < tokens.size() - 1; i++) {

                assert tokens.get(i).getType() == PyGoLexer.IDENTIFIER;

            }

        }


        private void testLiterals() {

            String input = "42 3.14159 \"hello world\" true false";

            List<Token> tokens = this.lexer.tokenize(input);

            

            int[] expectedTypes = {

                PyGoLexer.INTEGER, PyGoLexer.FLOAT, PyGoLexer.STRING,

                PyGoLexer.TRUE, PyGoLexer.FALSE, PyGoLexer.EOF

            };

            

            for (int i = 0; i < expectedTypes.length; i++) {

                assert tokens.get(i).getType() == expectedTypes[i];

            }

        }


        private void testOperators() {

            String input = "+ - * / == != <= >= = < >";

            List<Token> tokens = this.lexer.tokenize(input);

            

            int[] expectedTypes = {

                PyGoLexer.PLUS, PyGoLexer.MINUS, PyGoLexer.MULTIPLY,

                PyGoLexer.DIVIDE, PyGoLexer.EQUAL_EQUAL, PyGoLexer.NOT_EQUAL,

                PyGoLexer.LESS_EQUAL, PyGoLexer.GREATER_EQUAL, PyGoLexer.EQUALS,

                PyGoLexer.LESS_THAN, PyGoLexer.GREATER_THAN, PyGoLexer.EOF

            };

            

            for (int i = 0; i < expectedTypes.length; i++) {

                assert tokens.get(i).getType() == expectedTypes[i];

            }

        }


        private void testComments() {

            String input = "var x # this is a comment\n/* block comment */ var y";

            List<Token> tokens = this.lexer.tokenize(input);

            

            // Comments should be skipped, only VAR, IDENTIFIER tokens remain

            int[] expectedTypes = {

                PyGoLexer.VAR, PyGoLexer.IDENTIFIER,

                PyGoLexer.VAR, PyGoLexer.IDENTIFIER, PyGoLexer.EOF

            };

            

            for (int i = 0; i < expectedTypes.length; i++) {

                assert tokens.get(i).getType() == expectedTypes[i];

            }

        }


        private void testComplexExpressions() {

            String input = "func calculate(x: int, y: float) -> float:";

            List<Token> tokens = this.lexer.tokenize(input);

            

            int[] expectedTypes = {

                PyGoLexer.FUNC, PyGoLexer.IDENTIFIER, PyGoLexer.LEFT_PAREN,

                PyGoLexer.IDENTIFIER, PyGoLexer.COLON, PyGoLexer.INT_TYPE,

                PyGoLexer.COMMA, PyGoLexer.IDENTIFIER, PyGoLexer.COLON,

                PyGoLexer.FLOAT_TYPE, PyGoLexer.RIGHT_PAREN, PyGoLexer.ARROW,

                PyGoLexer.FLOAT_TYPE, PyGoLexer.COLON, PyGoLexer.EOF

            };

            

            for (int i = 0; i < expectedTypes.length; i++) {

                assert tokens.get(i).getType() == expectedTypes[i];

            }

        }

    }


LEXER PERFORMANCE CONSIDERATIONS


The generated ANTLR lexer provides excellent performance for most use cases, but understanding its behavior helps optimize compilation speed for large source files. The lexer processes input characters sequentially using finite automata, providing linear time complexity for tokenization.


Memory usage scales with input size since the lexer maintains internal buffers for character processing and token creation. For very large files, streaming approaches can reduce memory consumption while maintaining good performance characteristics.


The lexer's error recovery mechanisms ensure robust handling of malformed input without catastrophic failures. When encountering invalid characters or incomplete tokens, the lexer generates appropriate error messages and continues processing to find additional problems.


INTEGRATION WITH COMPILER PIPELINE


The PyGo lexer integrates seamlessly with the overall compiler architecture by providing a clean token stream interface that the parser can consume. The lexer handles all low-level character processing details while exposing only the essential token information needed for syntax analysis.


Token objects include type information, text content, and position data that enables precise error reporting throughout the compilation process. This information proves invaluable during parsing and semantic analysis phases where location-specific error messages significantly improve the development experience.


The lexer's error handling integrates with the compiler's overall error management system, allowing consistent error reporting across all compilation phases. This unified approach ensures that programmers receive coherent feedback regardless of where problems occur in their source code.


CONCLUSION OF ARTICLE 2


This article has demonstrated the complete implementation of a PyGo lexer using ANTLR v4. The lexer handles all PyGo tokens including keywords, identifiers, literals, operators, and delimiters while providing robust error handling and recovery mechanisms.


The ANTLR-generated lexer provides excellent performance and maintainability compared to hand-written alternatives. The grammar-driven approach ensures consistency and makes it easy to modify the lexer as the PyGo language evolves.


The testing framework validates lexer correctness across various input patterns and edge cases. Comprehensive testing ensures that the lexer reliably processes PyGo source code and provides meaningful error messages for invalid input.


In Article 3, we will build upon this lexer foundation to implement a complete PyGo parser using ANTLR v4. The parser will consume the token stream produced by this lexer and construct Abstract Syntax Trees that represent the structure of PyGo programs.

No comments: