We wish to build a lexer for a language with the following tokens:
This is how this is done in ANTLR (you can download the file here):
// The two lines below are necessary but are actually about
// parsing, which we'll discuss later
grammar MyLexer;
program : ;
// Reserved Keywords
IF: 'if';
ENDIF: 'endif';
PRINT: 'print';
INT: 'int';
// Operators
PLUS: '+';
EQUAL: '==';
ASSIGN: '=';
// Semicolon and parentheses
LPAREN: '(';
RPAREN: ')';
// Integers
INTEGER: [0-9][0-9]*;
// Variable names
NAME: [a-z]+;
// Ignore all white spaces
WS: [ \t\r\n]+ -> skip ;
For instance, putting the rule for IF after the rule for NAME would mean that ‘if’ would be recognized as a NAME token rather than an IF token.
Let us try to run ANTLR on a source file named sourcecode.txt with content:
int a ; ; ; while a == != b + 123
whatever 43 if endif if
Note that this source file is likely not syntactically correct for a useful language, but it can be lexed as longs as tokens are recognized.
% java -cp .:antlr-4.4-complete.jar -jar ~/ANTLR/antlr-4.4-complete.jar MyLexer.g4
% javac -cp .:antlr-4.4-complete.jar MyLexer*.java
% java -cp .:antlr-4.4-complete.jar org.antlr.v4.runtime.misc.TestRig MyLexer program sourcecode.txt -tokens
We see that the 18 tokens are recognized. Each token type is assignment to a number. For each token ANTLR indicates start and end character indices in the source code, the lexeme of the token, and lines and column numbers in the source file.