RegEx - Extracting Strings (part 2 of 2)

In the last article, we learned how to write and match a regex. Now we'll look at how to extract the string that matches the regex in Java.

Remember,

  1. If we want to match a character that has a special meaning in regex like . then we need to prefix it with a backslash \.

  2. Also in Java, if we are using a backslash \ inside double quotes then we need to prefix it with another backslash.

The Pattern & Matcher Class

As we already know, Regex is a language on its own that we can compile and store in an object of type Pattern.

String regex = "\\d{3}";
Pattern pattern = Pattern.compile(regex);

Now, to actually match this regex to a test string we can use a Matcher class like below. Matcher is the engine that would match the regex with the test string.

String testString = "999";
Matcher matcher = pattern.matcher(testString);

To check if the testString is matching with the regex, we use matcher.matches() method. It will return true if it matches and false otherwise.

System.out.println(matcher.matches());    
// returns true since the string 999 matches the regex \d{3}

Capture Groups

Using () parentheses, we may group a pattern in regex and extract any of the groups that we want using an index of that group. These groups are known as "capture groups".

Let's say we have a mobile number in the format +91 99988887777. To match mobile numbers that are in this format, we can write regex like +\d{1,3}\s\d{10}

RegexExplanation
+\d{1,3}Represents the country code i.e. +91. A country code can be of 1 digit hence specified the minimum number as 1 and can go up to 3. Hence specified maximum is 3.
\sTo match the space between the country code and mobile number.
\d{10}To match the actual mobile number that should be 10 digits.

To create groups, we wrap the part of regex around ( ). So (+\d{1,3}) will be one group for the country code and (\d{10}) will be another group for the mobile number. Since we are not interested in \s, there is no need to put it in a group.

To extract the string that matches the first group, we can call matcher.group(1) and these would return the country code in String format. Same for the second group.

capture groups in regex

String countryCode = matcher.group(1);    // returns +91
String mobileNumber = matcher.group(2);   // returns 9999988888

Naming groups

We can also name a capture group using ?<nameOfTheCaptureGroup> and then extract the substring using the name of the group instead of the index.
Let's add names to a regex that we saw previously, +\d{1,3}\s\d{10}
(?<countryCoude>+\d{1,3})\s(?<mobileNumber>\d{10}) - Here we have named group 1 as countryCode and group 2 as mobileNumber

naming groups in regex

String countryCode = matcher.group(countryCode);  // returns +91
String mobileNumber = matcher.group(mobileNumber);//returns 9999988888

Quantifiers in Groups

We can also use quantifiers with the capture groups.
Remember: If we are using a quantifier with a group then we only have access to the string that matches at last. (Refer diagram to understand better)

Example - If the mobile number is in the format - +91 99999 88888 then we can write a regex as - (+\d{1,3})(\s\d{5}){2}
In here, matcher.group(2) would only return the last 5 digits and we won't have access to the first 5 digits of the mobile number.

using quantifiers with groups in regex

Nested Groups

If suppose we have nested capture groups then indexing for them would change. Going from left to right, As soon as we encounter an opening parenthesis that becomes group 1 and so on.
Assume a regex: (cat(dog(animal)))(bird) which obviously isn't a helpful regex but for the purpose of explanation it suffices.

Note that group 0 is the entire regex.

nested groups in regex

Skipping a group

Let's say we are creating groups and we don't want to use a particular group then we can skip it using ?: notation and this would say regex to skip that particular group.

Example: (cat(dog(?:animal)))(bird) Here we have skipped the regex animal. Since we are skipping a regex animal, Indexing for the groups would change.

skipping a group from being indexed in regex

Flags in Pattern.compile()

As the regex pattern grows, it becomes impossible to keep track of what part of it is doing what. In cases like this, we can also add comments. To add comments we need to enable it using a flag called Pattern.COMMENTS.

If the test string is of multiple lines and we want to match $ and ^ to the start and end of each line respectively then we need to enable Pattern.MULTILINE flag.
By default, $ and ^ match the start and end of the entire test string.

Remember, that . metacharacter represents any character except the new line. If we want it to represent all the characters then we can enable a flag called Pattern.DOTALL

To use multiple lines we need to start and end the regex string with triple-double quotes """

Note: If we enable comments then we have to use \s to match spaces because any white character in the regex will be ignored.

Pattern.compile(regex, 
               Pattern.COMMENTS | Pattern.MULTILINE | Pattern.DOTALL);
// We can use multiple flags at once using the pipe | character

An example of regex that uses multiple lines and has comments enabled.

regex = """
    # A regex to parse phone numbers in the format 
    # 1 999 8888 6666 OR 1-999-8888-6666 OR 1.999.8888.6666

    (?:(?<countryCode>\\d{1,3})[-.\\s])    # gets country code
    (?:(?<first3Digits>\\d{3})[-.\\s])     # gets first 3 digits
    (?:(?<second3Digits>\\d{3})[-.\\s])    # gets second 3 digits
    (?<last4Digits>\\d{4})                 # gets last 4 digits
    """;
phoneNumber = "91.987.654.3210";

Compiling is a resource-intensive task so be sure not to use it frequently.

Greedy & Lazy Quantifier

The quantifiers * and + in their default state behave as greedy quantifiers. That means they will select as many characters as possible. Let's understand it in a better way, Refer to the below code block and analyse what it will print.

String str = "Marks in Sem First: 69 Marks in Sem Second: 87";
String regex = "Marks.*(\\d{1,3}).*";

Pattern pat = Pattern.compile(regex);
Matcher mat = pat.matcher(str);

System.out.println(mat.group(1));

If you answered 69 then you are wrong. It will print 7 that is because .* will match as many characters as possible (greedy) until the regex satisfies.

greedy operator in regex

This behaviour of the * or + quantifier is called as greedy approach and It is unintuitive at first. To make them lazy (select as few characters as possible) we can add ? after them.

String str = "Marks in Sem First: 69 Marks in Sem Second: 87";
String regex = "Marks.*?(\\d{1,3}).*";

Pattern pat = Pattern.compile(regex);
Matcher mat = pat.matcher(str);

System.out.println(mat.group(1));    // prints 69

Another way to match Repeating patterns

Suppose we have a below string that has a repeating pattern. Here we can write a regex that matches the repeating pattern only and not the entire string. We can use find() that will keep returning true until the repeating pattern matches.

Note: Since we are not writing a regex for the entire string If we do pattern.matches() then it would return false

String str = """
                Name: Bruce Lee, DOB: 27-11-1940
                Name: Jackie Chan, DOB: 07-04-1954
                """;

String regex = """
               Name:\\s(\\w+\\s\\w+),\\s
               DOB:\\s(\\d{2}-\\d{2}-\\d{4})""";

Pattern pat = Pattern.compile(regex, Pattern.MULTILINE);
Matcher mat = pat.matcher(str);

while (mat.find()) {
    System.out.println(mat.group(1) + " : " +mat.group(2));
}
// output
Bruce Lee : 27-11-1940
Jackie Chan : 07-04-1954

Challenge

Write a regex that matches the below paragraph and extracts the following details. Student Number, Grade, Birthdate, Gender, State ID, Cumulative GPA (Weighted), Cumulative GPA (Unweighted).
Use comments, multiple lines in the regex

// Required Input
String paragraph = """
        Student Number:    1234598872            Grade:        11
        Birthdate:        01/02/2000            Gender:    M
        State ID:        8923827123

        Cumulative GPA (Weighted)        3.82
        Cumulative GPA (Unweighted)    3.46"""
// Required Output
1234598872
11
01/02/2000
M
8923827123
3.82
3.46

Solution

String regex = """

 # student number
 Student\\sNumber:\\s(?<studentNumber>\\d{10})\\s*   

 #gets grade
 Grade:\\s*(?<grade>\\d{2})\\n   

 # gets birthdate
 Birthdate:\\s*(?<birthDate>\\d{2}/\\d{2}/\\d{4})\\s* 

 # gets birthdate
 Gender:\\s*(?<gender>\\w)\\n    

 # gets stateID
 State\\sID:\\s*(?<stateId>\\d{10})\\n\\n    

 # gets weighted gpa
 Cumulative\\sGPA\\s\\(Weighted\\)\\s*(?<weightedGpa>\\d\\.\\d{2})\\n 

 # gets unweighted gpa
 Cumulative\\sGPA\\s\\(Unweighted\\)\\s*(?<unweightedGpa>\\d\\.\\d{2}) """;

Pattern pattern = Pattern.compile(regex, Pattern.COMMENTS);
Matcher matcher = pattern.matcher(paragraph);

If this article helped you in having a decent understanding of RegEx in Java please drop a like.

Peace Out ✌️