Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 3524

Pre-compile Regex Patterns Into Pattern Objects

$
0
0

1. Overview

In this tutorial, we'll see the benefits of pre-compile a regex pattern and the new methods introduced in Java 8 and 11.

This will not be a regex how-to, but we have an excellent Guide To Java Regular Expressions API for that purpose.

2. Benefits

Reuse inevitably brings performance gain, as we don't need to create and recreate instances of the same objects time after time. So, we can assume that reuse and performance are often linked.

Let's take a look at this principle as it pertains to Pattern#compile. We'll use a simple benchmark:

  1. We have a list with 5,000,000 numbers from 1 to 5,000,000
  2. Our regex will match even numbers

So, let's test parsing these numbers with the following Java regex expressions:

  • String.matches(regex)
  • Pattern.matches(regex, charSequence)
  • Pattern.compile(regex).matcher(charSequence).matches()
  • Pre-compiled regex with many calls to preCompiledPattern.matcher(value).matches()
  • Pre-compiled regex with one Matcher instance and many calls to matcherFromPreCompiledPattern.reset(value).matches()

Actually, if we look at the String#matches‘s implementation:

public boolean matches(String regex) {
    return Pattern.matches(regex, this);
}

And at Pattern#matches:

public static boolean matches(String regex, CharSequence input) {
    Pattern p = compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

Then, we can imagine that the first three expressions will perform similarly. That's because the first expression calls the second, and the second calls the third.

The second point is that these methods do not reuse the Pattern and Matcher instances created. And, as we'll see in the benchmark, this degrades performance by a factor of six:

    
@Benchmark
public void matcherFromPreCompiledPatternResetMatches(Blackhole bh) {
    for (String value : values) {
        bh.consume(matcherFromPreCompiledPattern.reset(value).matches());
    }
}

@Benchmark
public void preCompiledPatternMatcherMatches(Blackhole bh) {
    for (String value : values) {
        bh.consume(preCompiledPattern.matcher(value).matches());
    }
}

@Benchmark
public void patternCompileMatcherMatches(Blackhole bh) {
    for (String value : values) {
        bh.consume(Pattern.compile(PATTERN).matcher(value).matches());
    }
}

@Benchmark
public void patternMatches(Blackhole bh) {
    for (String value : values) {
        bh.consume(Pattern.matches(PATTERN, value));
    }
}

@Benchmark
public void stringMatchs(Blackhole bh) {
    Instant start = Instant.now();
    for (String value : values) {
        bh.consume(value.matches(PATTERN));
    }
}

Looking at the benchmark results, there's no doubt that pre-compiled Pattern and reused Matcher are the winners with a result of more than six times faster:

Benchmark                                                               Mode  Cnt     Score     Error  Units
PatternPerformanceComparison.matcherFromPreCompiledPatternResetMatches  avgt   20   278.732 ±  22.960  ms/op
PatternPerformanceComparison.preCompiledPatternMatcherMatches           avgt   20   500.393 ±  34.182  ms/op
PatternPerformanceComparison.stringMatchs                               avgt   20  1433.099 ±  73.687  ms/op
PatternPerformanceComparison.patternCompileMatcherMatches               avgt   20  1774.429 ± 174.955  ms/op
PatternPerformanceComparison.patternMatches                             avgt   20  1792.874 ± 130.213  ms/op

Beyond performance times, we also have the number of objects created:

  • First three forms:
    • 5,000,000 Pattern instances created
    • 5,000,000 Matcher instances created
  • preCompiledPattern.matcher(value).matches()
    • 1 Pattern instance created
    • 5,000,000 Matcher instances created
  • matcherFromPreCompiledPattern.reset(value).matches()
    • 1 Pattern instance created
    • 1 Matcher instance created

So, instead of delegating our regex to String#matches or Pattern#matches that always will create the Pattern and Matcher instances. We should pre-compile our regex to earn performance and has fewer objects created.

To know more about performance in regex check out our Overview of Regular Expressions Performance in Java.

3. New Methods

Since the introduction of functional interfaces and streams, reuse has become easier.

The Pattern class has evolved in new Java versions to provide integration with streams and lambdas.

3.1. Java 8

Java 8 introduced two new methods: splitAsStream and asPredicate.

Let's look at some code for splitAsStream that creates a stream from the given input sequence around matches of the pattern:

@Test
public void givenPreCompiledPattern_whenCallSplitAsStream_thenReturnArrayWithValuesSplitByThePattern() {
    Pattern splitPreCompiledPattern = Pattern.compile("__");
    Stream<String> textSplitAsStream = splitPreCompiledPattern.splitAsStream("My_Name__is__Fabio_Silva");
    String[] textSplit = textSplitAsStream.toArray(String[]::new);

    assertEquals("My_Name", textSplit[0]);
    assertEquals("is", textSplit[1]);
    assertEquals("Fabio_Silva", textSplit[2]);
}

The asPredicate method creates a predicate that behaves as if it creates a matcher from the input sequence and then calls find:

string -> matcher(string).find();

Let's create a pattern that matches names from a list that have at least first and last names with at least three letters each:

@Test
public void givenPreCompiledPattern_whenCallAsPredicate_thenReturnPredicateToFindThePatternInTheListElements() {
    List<String> namesToValidate = Arrays.asList("Fabio Silva", "Mr. Silva");
    Pattern firstLastNamePreCompiledPattern = Pattern.compile("[a-zA-Z]{3,} [a-zA-Z]{3,}");
    
    Predicate<String> patternsAsPredicate = firstLastNamePreCompiledPattern.asPredicate();
    List<String> validNames = namesToValidate.stream()
        .filter(patternsAsPredicate)
        .collect(Collectors.toList());

    assertEquals(1,validNames.size());
    assertTrue(validNames.contains("Fabio Silva"));
}

3.2. Java 11

Java 11 introduced the asMatchPredicate method that creates a predicate that behaves as if it creates a matcher from the input sequence and then calls matches:

string -> matcher(string).matches();

Let's create a pattern that matches names from a list that have only first and last name with at least three letters each:

@Test
public void givenPreCompiledPattern_whenCallAsMatchPredicate_thenReturnMatchPredicateToMatchesThePatternInTheListElements() {
    List<String> namesToValidate = Arrays.asList("Fabio Silva", "Fabio Luis Silva");
    Pattern firstLastNamePreCompiledPattern = Pattern.compile("[a-zA-Z]{3,} [a-zA-Z]{3,}");
        
    Predicate<String> patternAsMatchPredicate = firstLastNamePreCompiledPattern.asMatchPredicate();
    List<String> validatedNames = namesToValidate.stream()
        .filter(patternAsMatchPredicate)
        .collect(Collectors.toList());

    assertTrue(validatedNames.contains("Fabio Silva"));
    assertFalse(validatedNames.contains("Fabio Luis Silva"));
}

4. Conclusion

In this tutorial, we saw that the use of pre-compiled patterns brings us a far superior performance.

We also learned about three new methods introduced in JDK 8 and JDK 11 that make our lives easier.

The code for these examples is available over on GitHub in core-java-11 for the JDK 11 snippets and core-java-text for the others.


Viewing all articles
Browse latest Browse all 3524

Trending Articles