
1. Introduction
Lexical analysis, also known as lexing or scanning, is the first phase of the compilation process. It takes a sequence of characters and converts them into meaningful units called tokens, which are the building blocks for further syntactic and semantic analysis.
In this tutorial, we’ll explore the fundamentals of lexical analysis and build a simple arithmetic lexer in Java. A well-designed lexer improves a compiler’s efficiency and maintainability by structuring raw input into clearly defined tokens.
2. What Is a Lexical Analyzer?
A lexical analyzer, or lexer, reads a stream of characters and organizes them into units called lexemes.
A lexeme is a raw sequence of characters that matches a pattern for a token. Combining this additional data with the lexeme value creates a token – a categorized element such as a keyword, an operator, or a literal. In other words, a token is a structured representation of a lexeme consisting of a type and an optional attribute value.
To illustrate, the lexemes in the expression “1 + 2” are “1,” “+,” and “2”. The lexer categorizes these and produces the following tokens: NUMBER(1), OPERATOR(+), and NUMBER(2).
Let’s build a simple arithmetic lexical analyzer to demonstrate the described approach.
3. Building an Arithmetic Lexical Analyzer
First, we need to define the scope of our lexical analyzer.
The lexer should recognize integer numbers and arithmetic operators: +, –, *, and /. The Grammar enum contains all the supported symbols and provides utility methods for character classification:
private enum Grammar {
ADDITION('+'),
SUBTRACTION('-'),
MULTIPLICATION('*'),
DIVISION('/');
private final char symbol;
Grammar(char symbol) {
this.symbol = symbol;
}
public static boolean isOperator(char character) {
return Arrays.stream(Grammar.values())
.anyMatch(grammar -> grammar.symbol == character);
}
public static boolean isDigit(char character) {
return Character.isDigit(character);
}
public static boolean isWhitespace(char character) {
return Character.isWhitespace(character);
}
public static boolean isValidSymbol(char character) {
return isOperator(character) || isWhitespace(character) || isDigit(character);
}
}
Next, we wrap the input strings into a simple utility Expression class. It allows the lexer to iterate over the characters of an expression one at a time. Additionally, the hasNext() method checks if more characters are left to process in the expression:
public class Expression {
private final String value;
private int index = 0;
public Expression(String value) {
this.value = value != null ? value : "";
}
public Optional<Character> next() {
if (index >= value.length()) {
return Optional.empty();
}
return Optional.of(value.charAt(index++));
}
public boolean hasNext() {
return index < value.length();
}
// standard getter
}
Finally, we’ll create a Token class. We can create a single Token class or choose an abstract base class and two concrete implementations, one for numbers and one for operators. Even though our lexer is relatively simple, the latter approach provides more apparent structure and easier extensibility in case we need to support a variety of token types in the future.
It also gives us the flexibility to add token-specific methods for handling behaviors unique to each type, such as converting values to different formats or calculating operator precedence:
public abstract class Token {
private final String value;
private final TokenType type;
protected Token(TokenType type, String value) {
this.type = type;
this.value = value;
}
public TokenType getType() {
return type;
}
public String getValue() {
return value;
}
}
public class TokenNumber extends Token {
protected TokenNumber(String value) {
super(TokenType.NUMBER, value);
}
public int getValueAsInt() {
return Integer.parseInt(getValue());
}
}
public class TokenOperator extends Token {
protected TokenOperator(String value) {
super(TokenType.OPERATOR, value);
}
}
At this point, we can create the main class for lexical analysis.
4. The Lexer Class
We’ll start by building the tokenize() method that processes the input expression and generates a list of tokens. The method needs to read character by character from the input until it can identify the next lexeme and produce the token for it.
For simplicity, we won’t allow input strings to start with an operator, meaning the first character of the valid expression must be a digit. In other words, all operators will be treated as right-associative.
There are various approaches to building a lexer, so let’s explore a few.
4.1. The Finite State Machine (FSM) Lexer
We can build the lexer using a state machine. We’ll create a State enum to track the current state of tokenization:
private enum State {
INITIAL,
NUMBER,
OPERATOR,
INVALID
}
The INITIAL state is the starting one. Here, the lexer expects a number, an operator, or a whitespace. In the case of whitespace, the processing continues to the next character. If the lexer encounters a digit, it’ll move the state to NUMBER, indicating the processing of a single or a sequence of digits, forming a number.
Likewise, the OPERATOR state indicates that the lexer is processing an operator. Additionally, we must add a check to ensure no expressions start with an OPERATOR. The last state handles invalid characters or unexpected symbols that don’t match valid token patterns:
public List<Token> tokenize(Expression expression) {
State state = State.INITIAL;
StringBuilder currentToken = new StringBuilder();
ArrayList<Token> tokens = new ArrayList<>();
while (expression.hasNext()) {
final Character currentChar = getValidNextCharacter(expression);
switch (state) {
case INITIAL:
if (Grammar.isWhitespace(currentChar)) {
break;
} else if (Grammar.isDigit(currentChar)) {
state = State.NUMBER;
currentToken.append(currentChar);
} else if (Grammar.isOperator(currentChar) && !tokens.isEmpty()) { // to ensure there are no expressions starting with an OPERATOR
state = State.OPERATOR;
currentToken.append(currentChar);
} else {
state = State.INVALID;
currentToken.append(currentChar);
}
break;
case NUMBER:
if (Grammar.isDigit(currentChar)) {
currentToken.append(currentChar);
} else {
tokens.add(new TokenNumber(currentToken.toString()));
currentToken.setLength(0);
state = State.INITIAL;
}
break;
case OPERATOR:
tokens.add(new TokenOperator(currentToken.toString()));
currentToken.setLength(0);
state = State.INITIAL;
continue;
case INVALID:
throw new InvalidExpressionException(String.format(MESSAGE_ERROR, currentToken));
}
}
finalizeToken(state, currentToken, tokens);
return tokens;
}
In the NUMBER state, the lexer will either continue appending the digits to the existing token (in the case of a multi-digit number) or create a new token and reset the state. Since, in our example, there are no multicharacter operators (such as increment or decrement operators), the only transition from the OPERATOR state is to the INITIAL, which also includes creating a new TokenOperator.
The finalizeToken() method ensures the lexer handles the last token correctly when the input ends. If the lexer ends in a valid NUMBER state, it adds the token to the list. Otherwise, if it ends in an INVALID or OPERATOR state, it throws an exception indicating an error in tokenization:
private static void finalizeToken(State state, StringBuilder currentToken, ArrayList<Token> tokens) {
if (State.INVALID == state || State.OPERATOR == state) {
throw new InvalidExpressionException(String.format(MESSAGE_ERROR, currentToken));
} else if (State.NUMBER == state) {
tokens.add(new TokenNumber(currentToken.toString()));
}
}
The regex-based approach is an alternative solution to the FSM lexer and is suitable for more straightforward scenarios.
4.2. The Regex Lexer
To define a lexer that uses regular expressions to tokenize an input string, we first need to define the necessary regexes:
private static final String NUMBER_REGEX = "\\d+";
private static final String OPERATOR_REGEX = "[+\\-*/]";
private static final String WHITESPACE_REGEX = "\\s+";
private static final String VALID_EXPRESSION_REGEX = "^(" + NUMBER_REGEX + "(" + OPERATOR_REGEX + NUMBER_REGEX + ")*|" + NUMBER_REGEX + " )$";
private static final Pattern TOKEN_PATTERN = Pattern.compile(NUMBER_REGEX + "|" + OPERATOR_REGEX + "|" + WHITESPACE_REGEX);
We’ll also combine regexes for numbers, operators, and whitespace into a single pattern. The tokenize() method uses a Matcher to find matches of these patterns in the input string and creates corresponding tokens (either TokenNumber or TokenOperator):
public List<Token> tokenize(Expression expression) {
List<Token> tokens = new ArrayList<>();
Matcher matcher = TOKEN_PATTERN.matcher(expression.getValue());
if (!expression.getValue()
.matches(VALID_EXPRESSION_REGEX)) {
throw new InvalidExpressionException(String.format(MESSAGE_ERROR, expression));
}
while (matcher.find()) {
String match = matcher.group();
createToken(match).ifPresent(tokens::add);
}
return tokens;
}
private static Optional<Token> createToken(String match) {
if (match.matches(NUMBER_REGEX)) {
return Optional.of(new TokenNumber(match));
} else if (match.matches(OPERATOR_REGEX)) {
return Optional.of(new TokenOperator(match));
} else if (match.matches(WHITESPACE_REGEX)) {
return Optional.empty();
} else {
throw new InvalidExpressionException(String.format(MESSAGE_ERROR, match));
}
}
If a match is whitespace, it’s ignored; if it doesn’t match any valid pattern, a custom InvalidExpressionException is thrown.
5. Testing the Lexer
Let’s now test our lexers. We can easily switch between different implementations by modifying the field in the test class. To verify that the lexer throws an exception when it encounters invalid input, we’ll create a parameterized test and provide a set of incorrect expressions:
private final Lexer lexer = new LexerFsm();
@ParameterizedTest
@ValueSource(strings = { "1 + 2 $ 3", "1 - 2 #", "- 1 + 2", "+ 1 2" })
void givenInputContainsInvalidCharacters_whenTokenize_thenExceptionThrown(String input) {
assertThrows(Exception.class, () -> lexer.tokenize(new Expression(input)));
}
The test passes since none of the provided values is a valid expression.
Next, let’s check if lexer returns the correct tokens using a simple expression:
@Test
void givenInputIsSimpleExpression_whenTokenize_thenValidTokensIsReturned() {
String input = "3 + 50";
List<Token> tokens = lexer.tokenize(new Expression(input));
assertAll(() -> assertEquals(3, tokens.size(), "Token count mismatch"),
() -> assertToken(tokens.get(0), TokenType.NUMBER, "3"),
() -> assertToken(tokens.get(1), TokenType.OPERATOR, "+"),
() -> assertToken(tokens.get(2), TokenType.NUMBER, "50"));
}
private void assertToken(Token token, TokenType expectedType, String expectedValue) {
assertAll(() -> assertEquals(expectedType, token.getType(), "Token type mismatch"),
() -> assertEquals(expectedValue, token.getValue(), "Token value mismatch"));
}
First, we check that the token count matches the expected number of tokens. Then, we assert that each token has the correct type and value using the assertToken() helper method. Finally, we verify that the first token is the number “3“, the second is the operator “+,” and the third is the number “50“.
Lastly, let’s make sure our lexer correctly processes the empty string:
@Test
void givenInputIsEmptyExpression_whenTokenize_thenEmptyListIsReturned() {
String input = "";
List<Token> tokens = lexer.tokenize(new Expression(input));
assertTrue(tokens.isEmpty(), "Lexer should return an empty list when the input expression is empty");
}
This test verifies that the lexer correctly returns an empty list when the input expression is empty. It also checks that no tokens are generated for an empty input.
6. Conclusion
In this article, we learned the concepts of lexical analysis and demonstrated how to build a simple arithmetic lexer in Java. We implemented lexers using an FSM and a Regex approach to illustrate different ways of tokenizing input.
The latter is straightforward and effective for handling simple token patterns, such as numbers and operators. On the other hand, FSMs provide greater flexibility and are more suitable for handling complex tasks, such as distinguishing between different operator types or managing nested structures.
Finally, we looked at a set of basic tests to ensure our lexers work correctly.
As always, the complete source code is available over on GitHub.
The post Constructing a Lexical Analyzer in Java first appeared on Baeldung.