RegEx - Regular Expression (part 1 of 2)

RegEx - Regular Expression (part 1 of 2)

Basics of RegEx

The regular expression, commonly referred to as RegEx, is a technique or language for expressing text patterns using some text symbols. We can write and supply regex to a regex engine to either check or extract the substrings that match the regex.

In this article, we will only be focusing on matching the regex and in the second part we would see how to extract the matching substring in Java.

We will be using Java to match and capture regex patterns, but you can also use any online regex matcher, such as https://regex101.com/

In Java, we can use .matches(String regex) a method to check if the regex matches. For example, "Hello".matches("Hello") would return true because both strings are the same and hence must be matched.

Matching a String

Character Class

We can use square brackets [ ] to check for an either-or condition.
For example, [Hh] means a character can be either H or h.

Similarly, [bc]at means any string starting with b or c and is followed by "at".

  • "cat".matches("[bc]at") would return true

  • "bat".matches("[bc]at") would also return true.

The set of characters inside the square brackets forms a character class.

Range in Character Class

We can also define a range inside a character class.

"[a-g]at" is the same as "[abcdefg]at" which eventually means any string starting with any letter from 'a' to 'g' and is followed by "at".

We can also combine ranges. "[a-zA-Z][0-9]" means any string starting with an alphabet and followed by a digit from 0 to 9.

Negating a Character Class

To negate a character class, we can use the ^ symbol in front of the character class. For example, if we want all three-letter small-case words that should not start with a, b, c or d then we can use something like this: "[^abcd][a-z][a-z]". Of course, there are better ways to do the same, as we will see in the later part of the blog.

Shorthand Character Class

Some character classes are used frequently and hence can be used as a shorthand notation.

  • \d - to select a digit. It is the same as [0-9]

  • \w - to select an alphabet, digit, or underscore. It is the same as [a-zA-Z0-9_]

  • \s - to select a white space character

  • \n - to select a newline character

We can also mention the quantity using quantifiers, i.e. curly braces { } after the shorthand character classes. For example, \d{3} - would select any 3-digit number.

We can also mention the minimum and maximum number of times the previous character should appear. For example - \d{3,5} - would select any 3, 4 or 5-digit number.

To use these shorthand character classes in Java, We need to prefix \ with another \ Since \ also has a special meaning in Java. For example "riz".matches("\\w{3}") would return true.

Challenge:
Can you select all the numbers with four or more digits in them?

Solution:
\d{4,} By only mentioning the minimum quantifier, we have only defined the lower bound, and hence this would select all the numbers with four or more digits in them.

Metacharacters

Metacharacters are characters that have special meanings in regex. Some of the commonly used metacharacters are:

  • * means the preceding character should come 0 or more times.

  • + means the preceding character should come 1 or more times.

  • ? means the preceding character should come 0 or 1 time.

More on the Shorthand Character Class

For some of the shorthand character classes, we can capitalize them to have a completely opposite meaning.

MetaCharactersDescription
\DAny character other than a digit. Negation of \d
\WAny character other than an alphabet, a digit or an underscore.
Negation of \w
\SAny character other than a space. Negation of \s

More on metacharacters

  • ^ - matches the beginning of the string

  • $ - matches the end of the string

  • . - matches any character except for the newline

  • | - matches either or condition. Example: a|b says either a or b

Note: If ^ is used outside [ ] only then it is used as the beginning of the String. Inside [ ] It is used for negating the character class.

Small Challenge

With what we have seen so far, Can you write a regex that matches dates in the format of DD-MM-YYYY?

Solution:

If you said the answer is \d{2}-\d{2}-\d{4} . Then you are partially wrong because this can also match 99-99-9999 which is not a valid date. Can you try again now?

Solving in bits and pieces:

For Date Only:

  • 0[1-9] for 01 to 09

  • [12][0-9] for 10 to 29

  • [3][01] for 30 and 31

At last, we can use the | operator for either or condition - 0[1-9]|[12][0-9]|[3][01]

For Month:

  • 0[1-9] for 01 to 09

  • 1[0-2] for 10 to 12

Adding both the condition using | operator - 0[1-9]|1[0-2]

For Year: We can normally use - \d{4}

The better solution becomes - 0[1-9]|[12][0-9]|[3][01]-0[1-9]|1[0-2]-\d{4}

Coming up Part 2

In the next part of Regex, we will see how to extract a substring that matches Regex and some amazing challenges that would clear your understanding of Regex. We will also

Peace Out ✌️