Lexer Implementation
The lexer (lexical analyzer) is the first phase of our interpreter. It transforms the source code into a sequence of tokens that can be processed by the parser.
Overview
The lexer implementation is contained in lexer.py and consists of three main components:
- Token handling
- The Lexer class
- Helper functions
Token Handling
The lexer uses a separate tok.py module for token definitions and works with a predefined set of keywords:
KEYWORDS = { 'fn': TokenType.FUNCTION, 'let': TokenType.LET, 'true': TokenType.TRUE, 'false': TokenType.FALSE, 'if': TokenType.IF, 'else': TokenType.ELSE, 'return': TokenType.RETURN}
def lookup_ident(literal: str) -> TokenType: """Check if the literal is a keyword, otherwise return IDENT""" return KEYWORDS.get(literal, TokenType.IDENT)String Handling
The lexer includes special handling for string literals:
def read_string(self) -> str: """Read and return a complete string literal""" position = self.position + 1 # skip opening quote while True: self.read_char() if self.ch == '"' or self.ch is None: break return self.input[position:self.position]String literals are recognized by double quotes and converted to STRING tokens:
# Example string token generation'"Hello, World!"' -> Token(STRING, "Hello, World!")'"Ancient One"' -> Token(STRING, "Ancient One")Lexer Class
The Lexer class is responsible for converting source code into tokens. Here are its key components:
Initialization
def __init__(self, input: str): self.input = input self.position = 0 # current position in input self.read_position = 0 # current reading position in input self.ch = None # current char under examination self.read_char()Character Reading
def read_char(self): """Read the next character in the input""" if self.read_position >= len(self.input): self.ch = None else: self.ch = self.input[self.read_position] self.position = self.read_position self.read_position += 1
def peek_char(self) -> str: """Look at the next character without advancing""" if self.read_position >= len(self.input): return None return self.input[self.read_position]Token Generation
The main method next_token() handles token generation with special cases for:
- Single character tokens (operators, delimiters)
- Two-character tokens (==, !=)
- Identifiers and keywords
- Integer literals
def next_token(self) -> Token: """Determine and return the next token""" self.skip_whitespace()
if self.ch is None: return Token(TokenType.EOF, "")
# Token handling using a mapping for single character tokens token_map = { '=': self._handle_equal, '+': lambda: Token(TokenType.PLUS, self.ch), '-': lambda: Token(TokenType.MINUS, self.ch), '!': self._handle_bang, '/': lambda: Token(TokenType.SLASH, self.ch), '*': lambda: Token(TokenType.ASTERISK, self.ch), '<': lambda: Token(TokenType.LT, self.ch), '>': lambda: Token(TokenType.GT, self.ch), ';': lambda: Token(TokenType.SEMICOLON, self.ch), ',': lambda: Token(TokenType.COMMA, self.ch), '{': lambda: Token(TokenType.LBRACE, self.ch), '}': lambda: Token(TokenType.RBRACE, self.ch), '(': lambda: Token(TokenType.LPAREN, self.ch), ')': lambda: Token(TokenType.RPAREN, self.ch) }Helper Methods
The lexer includes several helper methods for specific token types:
def _handle_equal(self) -> Token: """Handle both single '=' and '==' tokens""" if self.peek_char() == '=': ch = self.ch self.read_char() literal = ch + self.ch return Token(TokenType.EQ, literal) return Token(TokenType.ASSIGN, self.ch)
def _handle_bang(self) -> Token: """Handle both single '!' and '!=' tokens""" if self.peek_char() == '=': ch = self.ch self.read_char() literal = ch + self.ch return Token(TokenType.NOT_EQ, literal) return Token(TokenType.BANG, self.ch)
@staticmethoddef _is_letter(ch: str) -> bool: """Check if the character is a valid letter""" return ch is not None and ( ('a' <= ch <= 'z') or ('A' <= ch <= 'Z') or ch == '_' )
@staticmethoddef _is_digit(ch: str) -> bool: """Check if the character is a digit""" return ch is not None and '0' <= ch <= '9'Implementation Details
Whitespace Handling
def skip_whitespace(self): """Skip over whitespace characters""" while self.ch in [' ', '\t', '\n', '\r']: self.read_char()Identifier and Number Reading
def read_identifier(self) -> str: """Read and return a complete identifier""" position = self.position while self._is_letter(self.ch): self.read_char() return self.input[position:self.position]
def read_number(self) -> str: """Read and return a complete number""" position = self.position while self._is_digit(self.ch): self.read_char() return self.input[position:self.position]Best Practices
The lexer implementation follows these best practices:
- Clear separation of concerns with token handling in a separate module
- Efficient character-by-character processing
- Clean handling of two-character tokens
- Strong typing using Python type hints
- Comprehensive error handling with ILLEGAL tokens
- Memory-efficient string slicing for identifiers and numbers
Example Usage
def main(): # Test the lexer input_code = ''' let add = fn(x, y) { return x + y; }; add(5, 10); ''' lexer = Lexer(input_code)
while True: token = lexer.next_token() print(token) if token.type == TokenType.EOF: break