A Simple ANTLR Lexer

Let’s build a lexer for a language with the following tokens:

This is how this is done in ANTLR (you can download the file here):

// The two lines below are necessary but are actually about
// parsing, which we'll discuss later

grammar MyLanguageV0
program : ;


// Reserved Keywords
////////////////////////////////

IF: 'if';
ENDIF: 'endif';
PRINT: 'print';
INT: 'int';

// Operators
PLUS: '+';
EQUAL: '==';
ASSIGN: '=';

// Semicolon and parentheses
SEMICOLON: ';';
LPAREN: '(';
RPAREN: ')';

// Integers
INTEGER: [0-9][0-9]*;

// Variable names
NAME: [a-z]+;   

// Ignore all white spaces 
WS: [ \t\r\n]+ -> skip ; 

The order of the rules above matters: For instance, putting the rule for IF after the rule for NAME would mean that ‘if’ would be recognized as a NAME token first, rather than an IF token. This could be a feature, or a bug (likely a bug as modern languages enforce reserved keywords).

Let us try to run ANTLR on a source file named sourcecode.txt with content:

int a ; ; ; print a == != b + 123
whatever 43 if endif if 

Note that this source file is likely not syntactically correct for a useful language, but it can be lexed as longs as tokens are recognized. Said differently, the source file contains valid “words” even thought it contains likely no valid “sentence”.

% java -cp .:antlr-4.13.2-complete.jar -jar ./antlr-4.13.2-complete.jar MyLanguageV0.g4

% javac -cp .:antlr-4.13.2-complete.jar MyLanguageV0*.java

% java -cp .:antlr-4.13.2-complete.jar org.antlr.v4.gui.TestRig MyLanguageV0 program sourcecode.txt -tokens
[@0,0:2='int',<6>,1:0]
[@1,4:4='a',<13>,1:4]
[@2,6:6=';',<11>,1:6]
[@3,8:8=';',<11>,1:8]
[@4,10:10=';',<11>,1:10]
[@5,12:16='print',<1>,1:12]
[@6,18:18='a',<13>,1:18]
[@7,20:21='==',<8>,1:20]
[@8,23:24='!=',<10>,1:23]
[@9,26:26='b',<13>,1:26]
[@10,28:28='+',<7>,1:28]
[@11,30:32='123',<12>,1:30]
[@12,35:42='whatever',<13>,2:0]
[@13,44:45='43',<12>,2:9]
[@14,47:48='if',<3>,2:12]
[@15,50:54='endif',<4>,2:15]
[@16,56:57='if',<3>,2:21]
[@17,61:60='<EOF>',<-1>,4:0]

We see that the 18 tokens are recognized (the last one is End Of File). Each token type is assigned to a number. For each token ANTLR indicates start and end character indices in the source code, the lexeme of the token, and lines and column numbers in the source file.