Efficient substring matching with regular expressions in Java
In our introduction to Java regular expressions, we mentioned that one of the advantages
	of using regular expressions was efficiency. This may sound surprising at first glance. In order to
	interpret and apply our regular expression, the regex API clearly has to do some non-trivial work
	over a naive algorithm that simply reads and compares or counts characters in sequence.
	In a comparison of String.splut() vs StringTokenizer,
	while the String.split() method performed admirably, it was still half the speed of the
	(decprecated and less flexible) StringTokenizer.
You may therefore be asking: how efficient are regular expressions for performing typical string
	matching operations?
	In fact, they are often surprisingly efficient. As we will see below, the regex API
	contains optimisations that can make it more scalable than a naive implementation of the equivalent task.
As an example, let us consider the common case of substring matching, where we wish to determine
	the locations where a particular substring occurs within another larger string.
A naive routine to find substring matches might look as follows:
public static List<Integer> findMatchPoints(String str, String searchFor) {	
    List matchPoints = new ArrayList<>();
next_location:
    for (int i = 0; i < max; i++) {
        for (int j = 0; j < searchFor.length(); j++) {
            if (str.charAt(i + j) != searchFor.charAt(j)) {
                continue next_location;
            }
         }
         matchPoints.add(i);
     }
     return matchPoints;
 }
On short strings, this naive algorithm may be sufficient and even outperform its regular expression
	equivalent. But from a scalability perspective, it is inefficient: in the worst case of no matches,
	it will compare every single character in the input string with every single character in the substring being
	matched. Or put another way: this naive algorithm is inefficient because, whenever a non-match occurs at a particular
	position, it "throws away" potential information that was gathered along the way (of the non-matching subsequence,
	how many characters did match, and can this be used as a hint as to where to resume searching for the next
	potential match site?).
For sure, we could implement a more efficient algorithm from scratch (for example, the Knuth-Morris-Pratt
	algorithm or the Boyer-Moore-Horspool algorithm are two approaches). But the regular expression API already
offers such an algorithm out of the box. To gain the benefit, we can replace our method with the following:
public static List<Integer> findMatchPoints(CharSequence str, String searchFor) {
    Pattern p = Pattern.compile(searchFor);
    Matcher m = p.matcher(str);
    return m.results().map(MatchResult::start).collect(Collectors.toList());
}	
Let us consider a slightly contrived example:
	String: SPOONS AND SPIN SPAN IN PINS AND SNIPS, SPAN INTO SIPS
	Search for: SPAN
Our naive algorithm finds the two match sites in 64 comparisons, while the regex implementation (as of Java 9) finds them
	in 28 comparisons. In a real-world application where we needed to perform multiple comparisons, we would consider
	other optimisations such as re-using the same compiled Pattern instance where possible.
   
	  If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. 
 
	  
	 
     
 
 
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.