Regular Expression (Regex — often pronounced as ri-je-x or reg-x) is extremely useful while you are about to do Text Analytics or Natural Language Processing. But as much as Regex is useful, it’s also extremely confusing and hard to understand and always require (at least for me) multiple DDGing with click and back to multiple Stack Overflow links.
What’s Regex
According to Wikipedia, A regular expression, regex or regexp is a sequence of characters that define a search pattern.
How does it look?
This is the REGEX pattern to test the validity of a URL:
^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
A typical regular expression contains — Characters ( http ) and Meta Characters ([]). The combination of these two form a meaningful regular expression for a particular task.
So, What’s the problem?
Remembering the way in which characters and meta-characters are combined to create a meaningful regex is itself a tedious task which sometimes becomes a bigger task than the actual problem of NLP which is the larger goal.
Solution at Hand
Some good soul on this planet has created an open-source Javascript library JSVerbalExpressions to make Regex creation easy. Then some other good soul (Tyler Littlefield) ported the javascript library to R— RVerbalExpressions. This is the beauty of the open source world.
Installation
is available on RVerbalExpressions Github so you can use devtools
or remotes
to install it from Github.
# install.packages("devtools") devtools::install_github("VerbalExpressions/RVerbalExpressions")
Pseudo-Problem
Let’s create a pseudo-problem that we’d like to solve with regex through which we can understand this package to programmatically create regex.
A simpler one perhaps, We’ve got multiple text like and we’d like to extract the names from it. Here’s our input and output look like:
strings = c('123Abdul233','233Raja434','223Ethan Hunt444') Abdul, Raja, Ethan Hunt
Once we solve this, we’ll move forward with slightly complicated problems.
Pseudo-Code
Before we code, it’s always good to write-out a pseudo-code on a napkin or even a paper if you’ve got. That is, We want to extract names (which is composition of alphabets) except numbers (which is digits). We build a regex for one-line and then we iterate it for all the elements in our vector.
Loading
Like any other R package, we can load RVerbalExpressions with library() function.
library(RVerbalExpressions)
Constructing the Expression
Extract Strings
Like many other modern-day R packages, RVerbalExpressions support %>%
pipe operator for better simplicity and readability of the code. But for this problem of extracting strings that are present between the numbers, we can simply use one function that is rx_alpha()
to say that we need alphabets from the given string.
expr = rx_alpha() stringr::str_extract_all(strings,expr) [[1]] [1] "A" "b" "d" "u" "l" [[2]] [1] "R" "a" "j" "a" [[3]] [1] "E" "t" "h" "a" "n" "H" "u" "n" "t"
Extract Numbers
Similar to the text that we extracted, Extracting Numbers again is very English as we’ve to use the function rx_digit()
to say that we need numbers from the given text.
expr = rx_digit() stringr::str_extract_all(strings,expr) [[1]] [1] "1" "2" "3" "2" "3" "3" [[2]] [1] "2" "3" "3" "4" "3" "4" [[3]] [1] "2" "2" "3" "4" "4" "4"
Another Constructor to extract the name as a word
Here, we can use the function rx_word() to match it as word (rather than letters).
expr = rx_alpha() %>% rx_word() %>% rx_alpha() stringr::str_extract_all(strings,expr) [[1]] [1] "Abdul" [[2]] [1] "Raja" [[3]] [1] "Ethan" "Hunt"
Expression
What if we want to use the expression somewhere else or simply we need the regex expression. It’s simple because the expression is what we’ve constructed and printing what we constructed would reveal the relevant regex pattern.
expr "[A-z]\\w+[A-z]"
Summary
Thus, we managed to build a regex pattern without knowing regex. Simply put, we programmatically generated a regex pattern using R (that doesn’t require the high-level knowledge of regex patterns) and accomplished a tiny task that we took up to demonstrate the potential. For more of Regex, Check out this Datacamp course. The entire code is available here.