A simple ANTLR lexer

We wish to build a lexer for a language with the following tokens:

Reserved keywords: int, if, endif, while, endwhile
An addition operator: ‘+’
An assignment operator: ‘=’
An equal operator: ‘==’
A not-equal operator: ‘!=’
Integers
Variable names as strings of lower-case letters
Semicolons for terminating statements
The ability to ignore white spaces, tabs, carriage returns, etc.

This is how this is done in ANTLR (you can download the file here):

// The two lines below are necessary but are actually about
// parsing, which we'll discuss later

grammar MyLexer;
program : ;

// Reserved Keywords
////////////////////////////////

IF: 'if';
ENDIF: 'endif';
PRINT: 'print';
INT: 'int';

// Operators
PLUS: '+';
EQUAL: '==';
ASSIGN: '=';
NOTEQUAL: '!=';

// Semicolon and parentheses
SEMICOLON: ';';
LPAREN: '(';
RPAREN: ')';

// Integers
INTEGER: [0-9][0-9]*;

// Variable names
NAME: [a-z]+;   

// Ignore all white spaces 
WS: [ \t\r\n]+ -> skip ;

The order of the rules above matters.

For instance, putting the rule for IF after the rule for NAME would mean that ‘if’ would be recognized as a NAME token rather than an IF token.

Let us try to run ANTLR on a source file named sourcecode.txt with content:

int a ; ; ; while a == != b + 123
whatever 43 if endif if

Note that this source file is likely not syntactically correct for a useful language, but it can be lexed as longs as tokens are recognized.

% java  -cp .:antlr-4.4-complete.jar -jar ~/ANTLR/antlr-4.4-complete.jar MyLexer.g4

% javac  -cp .:antlr-4.4-complete.jar MyLexer*.java

% java  -cp .:antlr-4.4-complete.jar org.antlr.v4.runtime.misc.TestRig MyLexer program sourcecode.txt -tokens
[@0,0:2='int',<6>,1:0]
[@1,4:4='a',<13>,1:4]
[@2,6:6=';',<11>,1:6]
[@3,8:8=';',<11>,1:8]
[@4,10:10=';',<11>,1:10]
[@5,12:16='while',<1>,1:12]
[@6,18:18='a',<13>,1:18]
[@7,20:21='==',<8>,1:20]
[@8,23:24='!=',<10>,1:23]
[@9,26:26='b',<13>,1:26]
[@10,28:28='+',<7>,1:28]
[@11,30:32='123',<12>,1:30]
[@12,35:42='whatever',<13>,2:0]
[@13,44:45='43',<12>,2:9]
[@14,47:48='if',<3>,2:12]
[@15,50:54='endif',<4>,2:15]
[@16,56:57='if',<3>,2:21]
[@17,61:60='<EOF>',<-1>,4:0]

We see that the 18 tokens are recognized. Each token type is assignment to a number. For each token ANTLR indicates start and end character indices in the source code, the lexeme of the token, and lines and column numbers in the source file.