Regular Expression Syntax

Last modified on August 24, 2020

If you have programmed in Perl or any other language with built-in regular-expression capabilities, then you probably know how much easier regular expressions make text processing and pattern matching. If you are unfamiliar with the term, a regular expression is simply a string of characters that defines a pattern used to search for a matching string. The AlertSite keyword match facility allows you to use the power of regular expressions to create complex pattern matches to monitor your sites.

Note: The regular expression feature is being offered to customers as a courtesy to provide expanded matching functionality. We do not offer technical support for the use of regular expressions. A verbose set of help and examples are provided below.

The following is a brief introduction to regular expression syntax to get you started.

Syntax

Simple Match

Suppose you want to search for a string with the word cat in it. In that case, your regular expression would simply be cat. If your search is case-insensitive, the words catalog, Catherine, or sophisticated would also match:

  • Regular expression: cat
  • Matches: catalog, Catherine, sophisticated
Period Notation

To match a three-letter word starting with the t letter and ending with the n letter as a regular expression, you can use a wildcard notation – the period character (.). The regular expression would then be t.n and would match tan, ten, tin, and ton; it would also match t#n, tpn, and even t n, as well as many other nonsensical words. This is because the period character matches everything, including the space, the tab character, and even line breaks:

  • Regular expression: t.n
  • Matches: tan, ten, tin, ton, t#n, tpn, t n
Bracket Notation

To solve the problem of the period’s indiscriminate matches, you can specify characters you consider meaningful with the bracket ([ ]) expression, so that only those characters would match the regular expression. Thus, t[aeio]n would just match tan, ten, tin, and ton. toon would not match because you can only match a single character within the bracket notation:

  • Regular expression: t[aeio]n
  • Matches: tan, ten, tin, ton
OR operator

If you want to match toon in addition to all the words matched in the previous section, you can use the | notation, which is basically an OR operator. To match toon, use the regular expression t(a|e|i|o|oo)n. You cannot use the bracket notation here because it will only match a single character. Instead, use parentheses (( )):

  • Regular expression: t(a|e|i|o|oo)n
  • tan, ten, tin, ton, toon

As you can see, parentheses may be used for grouping contiguous sets of character patterns together with an optional | operator to provide alternative selections during matching. That is, any of the alternative patterns within the group may produce a match (with left to right precedence):

  • Regular expression: Good (morning|afternoon|evening)!
  • Matches: Good morning!, Good afternoon!, Good evening!
Quantifier Notations

You may also want append quantifier notations to specify how often a particular character or group of characters should repeat. For example, you can use the * notation to specify that the previous character should match zero or more times:

  • Regular expression:: Surprise!*
  • Matches: Surprise!, Surprise!!, Surprise!!!

If the * notation is combined with the wildcard (period) character, it will match all (zero or more) characters, including spaces, tabs, and line breaks between two separate notations:

  • Regular expression: Hello.*There!
  • Matches: HelloThere!, Hello There!, Hello everyone over There!

The following quantifier notations may be used to determine how many times a given notation to the immediate left of the quantifier notation should repeat itself:

Notation Definition
* 0 or more times
+ 1 or more times
? 0 or 1 time
{n} Exactly n number of times
{n,} At least n times
{n,m} At least n but not more than m times
Template Matching

You may also want to match a particular format or template of text, rather than a literal pattern of static characters. For example, you want to match a generic social security number pattern. The format for US social security numbers is 999-99-9999. The regular expression you would use to match this is as follows:

  • Regular expression: [0-9]{3}\-[0-9]{2}\-[0-9]{4}
  • Matches: All social security numbers of the form 123-12-1234

In regular expressions, the hyphen (-) notation has special meaning; it indicates a (sequential) range of possible characters such as A-Z, a-z, or 0-9. Thus, the notation [0-9]{3} in the first element of the pattern matches any string of exactly 3 digits, each of which may range from 0-9. This is followed by an escaped hyphen character. You must escape the - character with a forward slash (\) when matching literal hyphens in a pattern because of its special meaning within a regular expression.

If, in your template pattern, you wish to make the hyphen optional, say, you consider both 999-99-9999 and 999999999 acceptable formats, you can use the ? quantifier notation as shown:

  • Regular expression: [0-9]{3}\-?[0-9]{2}\-?[0-9]{4}
  • Matches: All social security numbers of the forms 123-12-1234 and 1231212345

Let us take a look at another example. One format for US car plate numbers consists of four numeric characters followed by two letters. Thus, a regular expression might first include a [0-9]{4} numeric part, followed by a [A-Z]{2} textual part:

  • Regular expression: [0-9]{4}[A-Z]{2}
  • Matches: US car plate numbers of the 8836KV format
NOT Notation

The ^ notation is also called the NOT notation. If used in brackets, ^ indicates the character(s) you do not want to match. For example, the expression below matches all words except those starting with the letter x:

  • Regular expression: \b[^xy][a-z]+\b
  • Matches: All (lowercase) words except those that start with the x or y letters

In the above example, the + quantifier is used to specify one or more characters in range of a-z, and the \b notation is used to match at word boundaries.

Miscelanneous Notations

To make life easier, some shorthand notations for commonly used regular expressions also exist, as shown below:

\d [0-9]
\D [^0-9]
\w [A-Z0-9]
\W [^A-Z0-9]
\s [ \t\n\r\f]
\S [^ \t\n\r\f]

To illustrate, we can use \d for all instances of [0-9] we used before, as was the case with our social security number expressions. The revised regular expression is:

  • Regular expression: \d{3}\-\d{2}\-\d{4}Matches: All social security numbers of the form 123-12-1234

Or, suppose you want to match an IP address. It consists of four 1-byte segments (octets), each segment has a value between 0 and 255 and is separated from the others by a period. Thus, in each individual segment of the IP address, you have at least one and at most three digits. The following regular expression might be used to match just such a construct:

  • Regular expression: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
  • Matches: IP addresses that consist of four 3-digit segments, each with values between 0 and 255

You need to escape the period character because you literally want it to be there; you do not want it read in terms of its special meaning in regular expression syntax, as explained earlier. Other special characters that need to be escaped when used in a literal match are discussed in the Additional Considerations section below.

Perhaps you are trying to match a particular type of date string. A typical date format might be: June 26, 1951. One example of a regular expression to match strings of this type would be:

  • Regular expression: [A-Za-z]+\s+[0-9]{1,2},\s*[0-9]{4}
  • Matches: All dates with the format of Month DD, YYYY

Broken down, the first element of the expression ([A-Za-z]+) matches the month (rather, a word consisting of at least 1 alphabetic character), followed by a mandatory space (\s+), followed by the day of the month up to 2 digits ([0-9]{1,2}), followed by a mandatory comma, followed by an optional space (\s*) followed by a four-digit year field ([0-9]{4}). This pattern may be adequate, but you might also choose to enclose the full set of month names within a parenthetical grouping, separate by the | notation, such as (January|February|March ... ) instead of the weaker [A-Za-z]+ notation.

Note: \s is shorthand notation for whitespace, and matches either a blank space, tab, newline, return, or form-feed character.
More Special Character Notations

The following table defines additional notations that may be useful in your regular expression pattern matches:

Notation Definition
\ Quote the next metacharacter.
^ Match the beginning of the line.
. Match any character.
$ Match the end of the line.
| Alternation (OR)
( ) Grouping.
[ ] Character class.
\w Match a word character (alphanumerics and _ chars).
\W Match a non-word character.
\s Match a whitespace character.
\S Match a non-whitespace character.
\d Match a digit character.
\D Match a non-digit character.
\b Match a word boundary.
\B Match a non-(word boundary).
\A Match only at beginning of string (same as ^).
\Z Match only at end of string (same as $).

Keyword Matching in Website URL Monitors

AlertSite’s keyword matching treats an entire web page as one continuous line of text. Therefore, both the Plain Text and Regular Expression keyword match types permit matches across multiple lines of HTML source text. Typical HTML source text usually includes plain text mixed together with HTML tags and attributes, and may optionally include snippets of programmatic scripting code.

It may be possible for your regular expression to satisfy multiple pattern matches on the same web page. Which pattern ultimately gets matched may or may not be what you desire. For example, you may only want to consider a match successful if the keyword or pattern is found at a particular location on the web page, or only if it appears on the page along with another keyword located somewhere else on the same page.

For example, the following HTML source code sample was retrieved from viewing the source of a page on your web site:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>CompanyName : My Company Website</title>

... additional HTML source code ...

    <a href="http://www.CompanyName.com/login.shtml">Click Here</a>

... additional HTML source code ...

    <strong>Copyright©1999-2004 CompanyName.</strong><br>

... additional HTML source code ...

If you wanted to create a regular expression to match CompanyName, but only when it appears in the title of your web page, you might use the following regular expression:

  • Regular expression: <title>.*CompanyName.*</title>
  • Matches: Any occurence of CompanyName between the HTML tags

Similarly, if you wanted your match to require multiple keywords from different areas of the page, say for example, CompanyName followed somewhere by Login Successful, it might look something like this:

  • Regular expression: CompanyName.*Login Successful
  • Matches: The string CompanyName, followed by any number of characters (all the middle stuff), followed by the string Login Successful

In the above examples, the .* quantifier will generally match as much of the source text as possible while still allowing the whole regular expression to match. Quantifiers that grab as much text as possible are called maximal match or greedy quantifiers (see Quantifier notations above).

But there are times when we would like these quantifiers to match a minimal piece of a text, rather than a maximal piece. The minimal match or non-greedy quantifiers are: ??, *?, +?, and {}?. These are the same standard quantifiers but with a ? appended to them. They have the following meanings:

Quantifier Description
?? Match 0 or 1 times. Try 0 first, then 1.
*? Match 0 or more times, but as few as possible.
+? Match 1 or more times, but as few as possible.
{n}? Match exactly n times. Equivalent to {n}.
{n,}? Match at least n times, but as few as possible.
{n,m}? Match at least n but no more than m times, as few as possible.

Since a regular expression can match a string in several different ways, we can use some of the following principles to predict which way the regular expression will match:

  • Principle 1: Taken as a whole, any regular expression will be matched at the earliest possible position in the string.

  • Principle 2: In an alternation a|b|c..., the leftmost alternative that allows a match for the whole regular expression will be the one used.

  • Principle 3: The maximal matching quantifiers ?, *, +, and {n,m} will in general match as much of the string as possible while still allowing the whole regular expression to match.

  • Principle 4: If there are two or more elements in a regular expression, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regular expression to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regular expression to match. And so on, until all the regular expression elements are satisfied.

Advanced Pattern Matching

In order to handle more complex pattern matching requirements, you may choose to use some of the more advanced features of regular expression syntax such as subpattern location independence and lookahead assertions. Suggested solutions to some of these situations are presented below. For more detailed information, please consider reviewing an online tutorial on regular expression syntax.

Location Independence

You might want to construct patterns where multiple search subpatterns may appear anywhere on the page, in any order. Here are some potential solutions (where ALPHA and BETA are your keywords or sub-patterns):

Regular Expression Matches
ALPHA|BETA Any occurence of either ALPHA or BETA, anywhere on the page (overlapping permitted).
(?.*ALPHA).*BETA= When both ALPHA and BETA occur, anywhere on the page (overlapping permitted).
(?:^.*ALPHA.*BETA)|(?:^.*BETA.*ALPHA) When both ALPHA and BETA occur, anywhere on the page (non-overlapping).
Lookaround Assertions

You might want to construct patterns which make use of look-ahead and look-behind assertions. Here are some potential solutions (where ALPHA and BETA are your keywords or sub-patterns):

Regular Expression Matches
ALPHA(?!BETA) Any occurence of ALPHA that does not have an occurence of BETA after it (negative look-ahead assertion).
(?<=ALPHA)BETA Any occurence of BETA that is preceded by ALPHA (positive look-behind assertion).
Case Sensitivity

To make your match criteria wholly or partially case insensitive, you may embed the (?i) and (?i:pattern) notations within your regular expressions, respectively. Here are some potential solutions (where ALPHA and BETA are your keywords or sub-patterns):

Regular Expression Matches
(?i)alpha-beta Any occurrence of ALPHA and BETA, regardless of case, separated by a dash (for example, alpha-beta, ALPHA-BETA, aLpHa-BetA, and so on).
(?i:alpha)-BETA Any occurence of ALPHA regardless of case, followed by a dash and an uppercase BETA (for example, aLpHa-BETA, Alpha-BETA, alphA-BETA, and so on).

Additional Considerations

Some other things you may want to consider when constructing your regular expressions:

  • You should not enclose your regular expression patterns between forward slashes, as they are already assumed.

  • The following special characters should be escaped (using a \ backslash) if you are trying to literally match these characters:

    \ ^ . $ | ( ) [ ] * + ? { } ,

  • Regular expression translation and substitution features are not used by the AlertSite keyword matching facility and thus are not supported.

Express Yourself

Now that you have been introduced to the pattern matching power of regular expressions, it is up to you to decide whether to use either a Plain Text match or the more powerful Regular Expression type. When used appropriately, regular expressions can help a great deal in constructing complex pattern matches for your site monitoring needs. This tutorial touches only briefly on the full capabilities of regular expression pattern matching. For additional information, you may wish to consult one of the many widely available regular expression tutorials on the Internet.

See Also

Website Monitor Settings

Highlight search results