Regular Expressions For Regular Folk

Groups

Groups, as the name suggests, are meant to be used to “group” components of regular expressions. These groups can be used to:

We’ll see how to do a lot of this in later chapters, but learning how groups work will allow us to study some great examples in these later chapters.

Capturing groups

Capturing groups are denoted by (). Here’s an expository example:

/a(bcd)e/g[RegExr] [Visual]
  • 1 matchabcde
    1. abcde
  • 1 matchabcdefg?
    1. abcde
  • 1 matchabcde
    1. abcde

Capturing groups allow extracting parts of matches.

/\{([^{}]*)\}/g[RegExr] [Visual]
  • 1 match{braces}
    1. {braces}
  • 2 matches{two} {pairs}
    1. {two}
    2. {pairs}
  • 1 match{ {nested} }
    1. {nested}
  • 1 match{ incomplete } }
    1. { incomplete }
  • 1 match{}
    1. {}
  • 0 matches{unmatched

    Using your language’s regex functions, you would be able to extract the text between the matched braces for each of these strings.

    Capturing groups can also be used to group regex parts for ease of repetition of said group. While we will cover repetition in detail in chapters that follow, here’s an example that demonstrates the utility of groups.

    /a(bcd)+e/g[RegExr] [Visual]
    • 1 matchabcdefg
      1. abcde
    • 1 matchabcdbcde
      1. abcdbcde
    • 1 matchabcdbcdbcdef
      1. abcdbcdbcde
    • 0 matchesae

      Other times, they are used to group logically similar parts of the regex for readability.

      /(\d\d\d\d)-W(\d\d)/g[RegExr] [Visual]
      • 1 match2020-W12
        1. 2020-W12
      • 1 match1970-W01
        1. 1970-W01
      • 1 match2050-W50-6
        1. 2050-W50
      • 1 match12050-W50
        1. 2050-W50

      Backreferences

      Backreferences allow referring to previously captured substrings.

      The match from the first group would be \1, that from the second would be \2, and so on…

      /([abc])=\1=\1/g[RegExr] [Visual]
      • 1 matcha=a=a
        1. a=a=a
      • 1 matchab=b=b
        1. b=b=b
      • 0 matchesa=b=c

        Backreferences cannot be used to reduce duplication in regexes. They refer to the match of groups, not the pattern.

        /[abc][abc][abc]/g[RegExr] [Visual]
        • 1 matchabc
          1. abc
        • 1 matcha cable
          1. cab
        • 1 matchaaa
          1. aaa
        • 1 matchbbb
          1. bbb
        • 1 matchccc
          1. ccc
        /([abc])\1\1/g[RegExr] [Visual]
        • 0 matchesabc
          1. 0 matchesa cable
            1. 1 matchaaa
              1. aaa
            2. 1 matchbbb
              1. bbb
            3. 1 matchccc
              1. ccc

            Here’s an example that demonstrates a common use-case:

            /\w+([,|])\w+\1\w+/g[RegExr] [Visual]
            • 1 matchcomma,separated,values
              1. comma,separated,values
            • 1 matchpipe|separated|values
              1. pipe|separated|values
            • 0 matcheswb|mixed,delimiters
              1. 0 matcheswb,mixed|delimiters

                This cannot be achieved with a repeated character classes.

                /\w+[,|]\w+[,|]\w+/g[RegExr] [Visual]
                • 1 matchcomma,separated,values
                  1. comma,separated,values
                • 1 matchpipe|separated|values
                  1. pipe|separated|values
                • 1 matchwb|mixed,delimiters
                  1. wb|mixed,delimiters
                • 1 matchwb,mixed|delimiters
                  1. wb,mixed|delimiters

                Non-capturing groups

                Non-capturing groups are very similar to capturing groups, except that they don’t create “captures”. They take the form (?:).

                Non-capturing groups are usually used in conjunction with capturing groups. Perhaps you are attempting to extract some parts of the matches using capturing groups. You may wish to use a group without messing up the order of the captures. This is where non-capturing groups come handy.

                Examples

                Query String Parameters

                /^\?(\w+)=(\w+)(?:&(\w+)=(\w+))*$/g[RegExr] [Visual]
                • 0 matches
                  1. 0 matches?
                    1. 1 match?a=b
                      1. ?a=b
                    2. 1 match?a=b&foo=bar
                      1. ?a=b&foo=bar

                    We match the first key-value pair separately because that allows us to use &, the separator, as part of the repeating group.

                    (Basic) HTML tags

                    As a rule of thumb, do not use regex to match XML/HTML.1234

                    However, it’s a relevant example:

                    /<([a-z]+)+>(.*)<\/\1>/gi[RegExr] [Visual]
                    • 1 match<p>paragraph</p>
                      1. <p>paragraph</p>
                    • 1 match<li>list item</li>
                      1. <li>list item</li>
                    • 1 match<p><span>nesting</span></p>
                      1. <p><span>nesting</span></p>
                    • 0 matches<p>hmm</li>
                      1. 1 match<p><p>not clever</p></p></p>
                        1. <p><p>not clever</p></p></p>

                      Names

                      Find: \b(\w+) (\w+)\b

                      Replace: $2, $15

                      Before

                      John Doe
                      Jane Doe
                      Sven Svensson
                      Janez Novak
                      Janez Kranjski
                      Tim Joe
                      

                      After

                      Doe, John
                      Doe, Jane
                      Svensson, Sven
                      Novak, Janez
                      Kranjski, Janez
                      Joe, Tim
                      

                      Backreferences and plurals

                      Find: \bword(s?)\b

                      Replace: phrase$15

                      Before

                      This is a paragraph with some words.
                      
                      Some instances of the word "word" are in their plural form: "words".
                      
                      Yet, some are in their singular form: "word".
                      

                      After

                      This is a paragraph with some phrases.
                      
                      Some instances of the phrase "phrase" are in their plural form: "phrases".
                      
                      Yet, some are in their singular form: "phrase".