Upcoming Hans

LibrePlanet at MIT in Boston

"Device and Personal Privacy Technology Roundup"

Sunday, 2018Mar25 @ 11:50

What R RegEx?

Short: Regular Expressions (RegEx) are sequences of characters that define a matching pattern using a specialized language.

Medium: RegEx define patterns that describe sets of strings. They are used by many common *NIX tools such as grep and sed. They can also be used in many programming languages such as Perl, PHP and Python. Database query languages also likely support RegEx.

Created in the 1950s.

Popularized by *NIX :)

What R ! RegEx?

RegEx Examples

sed -re -i 's/ *$//' script.sh
ip addr list | grep -E '(([12]{,1}[[:digit:]]{1,2})\.){3}([12]{,1}[[:digit:]]{1,2})'
ip addr list | grep -E '([[:lower:][:digit:]]{2}:){5}[[:lower:][:digit:]]{2}'

Meet Star

   /|
  / |
 /__|______
|          |
|   Star   |
|  __  __  |
| |  ||  | |
| |__||__| |
|  __  __()|      --------------
| |  ||  | |     < Hi, I'm Star >
| |  ||  | |      --------------
| |__||__| |   /
|          |  /
|__________| *

Star is a Helpful Neighbor

* == zero or more of the previous character

Star is a modifier acting on whatever comes before it

x* == zero or more x

y* == zero or more y

sed -re -i 's/ *$//' script.sh

Star grep Examples

$ # the same as "grep x file.txt"
$ grep -E 'xy*' file.txt
$ # find at least one x, still the same as "grep x file.txt"
$ grep -E 'xx*' file.txt
$ # use grep to cat the file
$ grep -E 'x*' file.txt
$ # sloppily also look for British spelling
$ grep -E 'colou*r' file.txt

Star sed Examples

$ # search for the first zero or more r, then replace
$ echo fred | sed -re 's/r*/x/'
xfred
$ # search for all zero or more r, then replace
$ echo fred | sed -re 's/r*/x/g'
xfxexdx
$ # search for all zero or more r, then replace
$ echo anke | sed -re 's/r*/x/g'
xaxnxkxex

Meet Plus

   /|
  / |
 /__|______
|   Plus   |
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |      -----------
|  __  __()|     < hi I'm Plus>
| |  ||  | |      -----------
| |  ||  | |   /
| |__||__| |  /
|__________| +

We’ve moved into a fancy neighborhood now!

Plus is a Neighbor That Counts ( to at least one )

+ == one or more of the previous character

Plus is a modifier acting on whatever comes before it

x+ == one or more x

y+ == one or more y

sed -re -i 's/ +$//' script.sh

Plus grep Examples

$ # search for x followed by at least one y
$ grep -E 'xy+' file.txt
$ # find at least one x, still the same as "grep x file.txt"
$ grep -E 'x+' file.txt
$ # sloppily look for only British spelling
$ grep -E 'colou+r' file.txt

Plus sed Examples

$ # search and replace the first one or more r
$ echo fred | sed -re 's/r+/x/'
fxed
$ # search and replace all one or more r
$ echo fred | sed -re 's/r+/x/g'
fxed
$ # search and replace all one or more r
$ echo anke | sed -re 's/r+/x/g'
anke

RegEx Variants

There are multiple RegEx languages

Extended RegEx - man 7 regex

Basic RegEx - man 7 regex

Perl Compatible Regex ( PCRE ) - man perlre

Fred’s House of RegEx ( FHRegEx: pronounced fregex )

RegEx Variant Usage

For command line and *NIX tools use extended where possible

If extended not available, check man page :)

For programming languages use PCRE or native matching

Symbols thus far

PCRE Star Catchup

# the same as "grep x file.txt"
$ grep -P 'xy*' file.txt
# find at least one x, still the same as "grep x file.txt"
$ grep -P 'xx*' file.txt
# use grep to cat the file
$ grep -P 'x*' file.txt
# sloppily also look for British spelling
$ grep -P 'colou*r' file.txt

Same as before

PCRE Plus Catchup

# search for x followed by at least one y
$ grep -P 'xy+' file.txt
# find at least one x, still the same as "grep x file.txt"
$ grep -P 'x+' file.txt
# sloppily look for British spelling
$ grep -P 'colou+r' file.txt

Same as before

Meet Dot

   /|
  / |
 /__|______
|   Dot    |
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |      ----------
|  __  __()|     < hi I'm Dot>
| |  ||  | |      ----------
| |  ||  | |   /
| |__||__| |  /
|__________| .

Still Single After All These Years

. == any single character

Dot is a wild card

Dot matches any single character except line breaks

Plus and Star match whatever comes before them, dot matches in place

x.+ == x followed by one or more characters

y.+ == y followed by one or more characters

Works the same in extended, PCRE and basic RegEx

Dot grep Examples

$ # find at least one x, still the same as "grep x file.txt"
$ grep -E 'x.*' file.txt
$ # search for x followed by at least one other character
$ grep -E 'x.+' file.txt
$ # find Fred-based names ( Freddy, Fredericka, etc. ), but not Fred
$ grep -E 'Fred.+' names.txt

Dot sed Examples

$ # replace r and all chars after it with x
$ echo fred | sed -re 's/r.+/x/'
fx
$ # replace r and all chars before it with x
$ echo fred | sed -re 's/.+r/x/'
xed
$ # replace f followed by any 2 characters with x
$ echo fred | sed -re 's/f../x/'
xd

Dot Notes

Repeated Dot ( .., .* or .+ ) doesn’t require matches to be the same character

Plus and Star are greedy and will match everything they can

Plus and Star combined with Dot matches everything

$ # show all lines in the file
$ grep '.*' file.txt
$ # show all lines in the file that have at least one character
$ grep '.+' file.txt

Dot vs. ! Dot

Unless escaped, a period is a dot

grep -i '2018.*.jpg' /var/mail/account # also matches 2018_fred_jpg.png

Use \. to match a period

grep -i '2018.*\.jpg' /var/mail/account # require a period

Meet Single Character Quote

   /|
  / |
 /__|______
|   SCQ    |
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |      ----------
|  __  __()|     < hi I'm SCQ>
| |  ||  | |      ----------
| |  ||  | |   /
| |__||__| |  /
|__________| \

Help Others Shine Through

Backslash quotes whatever comes right after it

\ == quote the next character, which won’t be interpreted as special character

\. == period, not dot

# find files that end in '.jpg'
$ find ~/Images/ | grep '\.jpg$'
$ # find lines that have a plus symbol in them
$ grep '\+' math.txt

Collection Discount

   /|
  / |
 /__|______
|Collection|
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |      -------------------
|  __  __()|     < hi I'm Collection >
| |  ||  | |      -------------------
| |  ||  | |   /
| |__||__| |  /
|__________| [ ]

Collections Are Square

Surround the collections with square brackets, aka bracket expression

[aeiou] == any lower case English full vowel

$ echo abcdefhij | sed -re 's/[aeiou]/./g'
.bcd.fh.j
$ echo abcdefhij | sed -re 's/[a1b2c3]/./g'
...defhij

Home on the Range

   /|
  / |
 /__|______
|  Range   |
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |      ------------------
|  __  __()|     < hi I make a range>
| |  ||  | |      ------------------
| |  ||  | |   /
| |__||__| |  /
|__________| -

Ranges in Collections

A range can be specified inside a collection

$ echo abcdefhij | sed -re 's/[a-e]/./g'
.....fhij
echo 1234567890 | sed -re 's/[1-9]/./g'
.........0

Build Some Character Classes

   /|
  / |
 /__|______
|Char Class|
|  __  __  |
| |  ||  | |
| |  ||  | |
| |__||__| |       -----------------------
|  __  __()|      < hi I'm Character Class>
| |  ||  | |       -----------------------
| |  ||  | |    /
| |__||__| |   /
|__________| [: :]

Nethack Builds Character

Not these character classes
$ echo abcdefhij | sed -re 's/[[:ranger:][:mage:][:thief:]]/./g'
sed: -e expression #1, char 18: Invalid character class name

Character Builds Collection

Character classes can be used inside collections

$ echo abcdefhij | sed -re 's/[[:alpha:]]/./g'
.........
$ echo CiHyFr82oap3 | sed -re 's/[[:lower:]]/./g'
C.H.F.82...3
$ echo CiHyFr82oap3 | sed -re 's/[[:digit:]]/./g'
CiHyFr..oap.
$ echo CiHyFr82oap3 | sed -re 's/[[:alnum:]]/./g'
............

Earlier Examples

$ ip addr list | grep -E '[12]{,1}[[:digit:]]{1,2}\.'
    inet 127.0.0.1/8 scope host lo
    inet 10.0.136.18/21 brd 10.0.143.255 scope global dynamic wlan0
$
$ ip addr list | grep -E '([12]{,1}[[:digit:]]{1,2}\.[12]{,1}[[:digit:]]{,2}\.[12]{,1}[[:digit:]]{1,2}\.[12]{,1}[[:digit:]]{1,2})'
$ ip addr list | grep -E '(([12]{,1}[[:digit:]]{1,2})\.){3}([12]{,1}[[:digit:]]{1,2})'

Cast of Characters

[:alpha:] == localized alphabet

[:digit:] == 0-9

[:alnum:] == localized alphabet and 0-9

[:blank:] == space, tab

[:punct:] == any printable character which is not a blank or an alnum

[:cntrl:] == control character

$ man 7 regex

Character is One

$ echo CiHyFr82oap3 | sed -re 's/[CiH[:digit:]]/./g'
...yFr..oap.
 echo CiHyFr82oap3 | sed -re 's/[C[:lower:][:digit:]]/./g'
..H.F.......

Not Your Parent’s RegEx

^ at the beginning of a collection means not

[^a]
$ # find Fred-based names ( Freddy, Fredericka, etc. ), but not Fred
$ grep -E 'Fred[^ ]+' names.txt
$ # find Fred-based names ( Freddy, Fredericka, etc. ), but not Fred
$ grep -E 'Fred[^[:blank:]]+' names.txt

Or

Branching allows matching this or the other

| == branch

$ echo fred | grep -E 'fred|anke'
fred
$ echo anke | grep -E 'fred|anke'
anke

Group Discount

A group can compartmentalize matches for future reference, aka atom

$ echo fred | sed -re 's/(.*)/\1 \1 \1/'
fred fred fred
$ echo fred anke | sed -re 's/(.*) (.*)/\2 \1/'
anke fred

Anchors Aweigh

^ == beginning of line when outside a collection at the beginning of the RegEx

$ == end of line when outside a collection at the end of the RegEx

^$ == empty line

^[[:blank:]]*$ == empty line or line with just space characters

Rooting out Root

$ grep -E root /etc/passwd
root:x:0:0:root:/root:/bin/bash
$ grep -Ec bin /etc/passwd
46
$ grep -E ^bin /etc/passwd
bin:x:2:2:bin:/bin:/usr/sbin/nologin

Max and Minitz

Use curly braces and a comma to match minimun or maximum number of times

$ echo ddd | sed -re 's/d{1,2}/q/'
qd
$ echo ddd | sed -re 's/d{1}/q/'
qdd
$ echo ddd | sed -re 's/d{1,}/q/'
q
$ # less sloppily also look for British spelling
$ grep -E 'colou{,1}r' file.txt

ASCII GNU

                    ,-----._
  .            .  ,'        `-.__,------._
 //          __\\'                        `-.
((    _____-'___))                           |
 `:='/     (alf_/                            |
 `.=|      |='                               |
    |)   O |                                  \
    |      |                               /\  \
    |     /                          .    /  \  \
    |    .-..__            ___   .--' \  |\   \  |
   |o o  |     ``--.___.  /   `-'      \  \\   \ |
    `--''        '  .' / /             |  | |   | \
                 |  | / /              |  | |   mmm
                 |  ||  |              | /| |
                 ( .' \ \              || | |
                 | |   \ \            // / /
                 | |    \ \          || |_|
                /  |    |_/         /_|
               /__/

Contacting Hans

Thank you!

Resources

Credits