While trying to get Java to #1 in the regexdna challenge for The Computer Language Benchmarks Game I was researching the performance of regular expression libraries for Java. The most recent website I could find was tusker.org from 2010. Hence I decided to redo the tests using the Java Microbenchmarking Harness and publish the results.
TL;DR: regular expressions are good for ad-hoc querying but if you have something performance sensitive, you should hand-code your solution (this doesn’t mean that you have to start from absolute zero – the Google Guava library has for example some nice utilities which can help in writing readable but also performant code).
And now, for some charts summarizing the performance – the test was run on an 64bit Ubuntu 15.10 machine with OpenJDK 1.8.0_66:
Small texts | Large texts | |
---|---|---|
Linear scale | ||
Logarithmic scale |
Observations:
- there is no “standard” for regular expressions, so different libraries can behave differently when given a particular regex and a particular string to match against – ie. one might say that it matches but the other might say that it doesn’t. For example, even though I used a very reduced set of testcases (5 regexes checked against 6 strings), only two of the libraries managed to match / not match them all correctly (one of them being java.util.Pattern).
-
it probably takes more than one try to get your regex right (tools like regexpal or The Regex Coach are very useful for experimenting)
-
the performance of a regex is hard to predict (and sometimes it can have exponential complexity based on the input length) – because of this you need to think twice if you accept a regular expression from arbitrary users on the Internet (like a search engine which would allow search by regular expressions for example)
-
none of the libraries seems to be in active development any more (in fact quite a few from the original list on tusker.org are now unavailable) and many of them are slower than the built-in j.u.Pattern, so if you use regexes that should probably be the first choice.
-
that said, the performance of both the hardware and JVM has been considerable, so if you are using one of these libraries, it is running generally an order of magnitude faster than it was five years ago. So there is no need to quickly replace working code (unless your profiler says that it is a problem :-))
-
watch out for calls to String.split in loops. While it has some optimization for particular cases (such as one-char regexes), you should almost always:
- see if you can use something like Splitter from Google Guava
- if you need a regular expression, at least pre-compile it outside of the loop
- the two surprises were dk.brics.automaton which outperformed everything else by several orders of magnitude, however:
- the last release was in 2011 and seems to be more an academic project
- it doesn’t support the same syntax as java.util.Pattern (but doesn’t give you a warning if you try to use a j.u.Pattern – it just won’t match the strings you think it should)
- doesn’t have an API as comfortable as j.u.Pattern (for example it’s missing replacements)
- the other surprise was kmy.regex.util.Regex, which – although not updated since 2000 – outperformed java.util.Pattern and passed all the tests (of which there weren’t admittedly many).
The complete list of libraries used:
Library name and version (release year) | Available in Maven Central | License | Average ops/second | Average ops/second (large text) | Passing tests |
---|---|---|---|---|---|
j.util.Pattern 1.8 (2015) | no (comes with JRE) | JRE license | 19 689 | 22 144 | 5 out of 5 |
dk.brics.automaton.Automaton 1.11-8 (2011) | yes | BSD | 2 600 225 | 115 374 276 | 2 out of 5 |
org.apache.regexp 1.4 (2005) | yes | Apache (?) | 6 738 | 16 895 | 4 out of 5 |
com.stevesoft.pat.Regex 1.5.3 (2009) | yes | LGPL v3 | 4 191 | 859 | 4 out of 5 |
net.sourceforge.jregex 1.2_01 (2002) | yes | BSD | 57 811 | 3 573 | 4 out of 5 |
kmy.regex.util.Regex 0.1.2 (2000) | no | Artistic License | 217 803 | 38 184 | 5 out of 5 |
org.apache.oro.text.regex.Perl5Matcher 2.0.8 (2003) | yes | Apache 2.0 | 31 906 | 2383 | 4 out of 5 |
gnu.regexp.RE 1.1.4 (2005?) | yes | GPL (?) | 11 848 | 1 509 | 4 out of 5 |
com.basistech.tclre.RePattern 0.13.6 (2015) | yes | Apache 2.0 | 11 598 | 43 | 3 out of 5 |
com.karneim.util.collection.regex.Pattern 1.1.1 (2005?) | yes | ? | – | – | 2 out of 5 |
org.apache.xerces.impl.xpath.regex.RegularExpression 2.11.0 (2014) | yes | Apache 2.0 | – | – | 4 out of 5 |
com.ibm.regex.RegularExpression 1.0.2 (no longer available) | no | ? | – | – | – |
RegularExpression.RE 1.1 (no longer available) | no | ? | – | – | – |
gnu.rex.Rex ? (no longer available) | no | ? | – | – | – |
monq.jfa.Regexp 1.1.1 (no longer available) | no | ? | – | – | – |
com.ibm.icu.text.UnicodeSet (ICU4J) 56.1 (2015) | yes | ICU License | – | – | – |
If you want to re-run the tests, check out the source code and run it as follows:
# we need to skip tests since almost all libraries fail a test or an other mvn -Dmaven.test.skip=true clean package # run the benchmarks java -cp lib/jint.jar:target/benchmarks.jar net.greypanther.javaadvent.regex.RegexBenchmarks
Find the complete source for the benchmarks on GitHub: https://github.com/gpanther/regex-libraries-benchmarks
Author: Attila-Mihály Balázs
A programmer journeyman who appreciates working in great teams to deliver good fit solutions to customers.
Caleb Cushing December 22, 2015
hate to be the one to tell you this, but there is a regular expression standard. https://en.wikipedia.org/wiki/Regular_expression#Standards
gpanther December 22, 2015
Today I learned something 🙂 Thank you!
Robert Sullivan November 4, 2016
Well, technically you are correct, but even as your link shows, there are standard(s), emphasis on the plural. And in the real world, as always, implmentations vary in the wild, i.e. see the related Wikipedia link : https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines As a quick example to illustrate the point, you have folks like Microsoft, with very creative and non-standard implementations, take a look at how the regex works in findstr, as a fun example. So, Attila’s main point stands: “different libraries can behave differently when given a particular regex and a particular string to match against –”
Isaac December 22, 2015
>>I got Java to #1 by using bit operations to check blocks of 8 bytes if they are potential matches and only then test them against the regular expressions. <<
No, your program was not accepted because it was not doing the work with regular expressions, it was using custom bit operations.
Dr Alexander J Turner December 22, 2015
Whilst there is a standard, not all libraries seem to stick to it. Anyhow, nice to see jmh being used to do sensible benchmarks. Thanks – for the post.
Isaac Gouy December 22, 2015
>>(spoiler alert: I got Java to #1 by some unorthodox solutions)<<
Simply untrue.
As you know, your program was never measured for the benchmarks game — because your program was designed not to use regex.
gpanther December 29, 2015
Thank you for keeping me honest. I removed that part of the post. I would still argue that the program qualifies (since regular expressions are more like “source code” and the different libraries interpret / “compile” it and the particular restriction is similar to “you are not allowed to use optimization X in GCC”). But I will leave it as is until I have the time to implement a more generic regular expression library which uses these techniques.
Jan December 29, 2015
Have you tried re2j? https://github.com/google/re2j
It is java implementation of Google’s https://github.com/google/re2. The c++ version is great. The Java version is less mature but it should be good enough.
gpanther December 29, 2015
Hello.
The library looked very interesting so I gave it a quick try but its performance was slightly below the performance of java.util.Pattern (surprisingly! Perhaps it was something related to the particular test cases). Also, it uses a non-standard license (which seems to be a modified version of the BSD license) which in my mind is also an argument for not using it.
Harald December 29, 2015
If you feel like, you may try monq again. I recently found time to work on it: https://github.com/HaraldKi/monqjfa/releases . Since it works with DFAs internally, there is no way it does all Perl regexp tricks.
A small comparison I did myself with java.util.Pattern, heavily biased towards the original use case of monq, can be found here: http://pifpafpuf.de/2015-11/generated-index.html Read from bottom to top for the full story.