Twitter Updates

    follow me on Twitter

    G-AVLN in front of her home

    G-AVLN in front of her home

    Mostly Unix and Linux topics. But flying might get a mention too.

    Tuesday, September 06, 2005

    grep with -E option

    One of the most confusing things newbies face when learning Unix is having to grasp regular expressions. Having gone through the pains of learning the globbing (wildcards) and other shell specials, only a couple of days later we throw them all into air (characters, that is - not the delegates) and try to catch them again!

    It's interesting to watch the delegates try to rationalise it all... What usually finishes them off, is the use of the counting mechanism, whereby one can specify how many instances of a character or a pattern is expected.

    For example: to locate lines (in datafile) with at least 4 characters using grep could be done simply with (remember, in regular expressions dot means any one character):

    $ grep '....' datafile

    But if you want to locate lines containing exactly four characters, grep is taking on a bit more complex detail. The counting mechanism used needs to be 'switched on' with the use of the backslash.

    $ grep '^.\{4\}$' datafile

    The delegates are by this stage happy (ish) with the 'anchoring' characters (^ and $) and know that this means tying the pattern to the beginning and end of line, respectively. What freaks them out is that \{ \} notation.

    I like using it, because it allows me to complete the explanation of the backslash usage in Unix - you use it to toggle a special-meaning of a character. We normally use it to turn the special meaning of a character off. In the counting notation of grep, we use it for the exactly opposite purpose. Here, it means turn the special meaning of the curly brackets on (and make it count the character specified in front of it).

    But, I will have to find a different example to illustrate the use of the backslash; I just can't justify for much longer keeping quiet about the -E option one can use with grep. This allows the use of the set of extended regular expressions, in which counting is part of the default, and the { } notation is taken to mean counting without any additional measures...

    Therefore, the last example could be re-written with:

    $ grep -E '^.{4}' datafile

    Much simpler, and less confusing...

    Why sudden change of heart? Had an e-mail from a delegate along the lines of: "Alina, did you know that..."

    1 comment:

    Clive said...

    GNU sed also has the -E option, but that is not in the SUS, unlike grep -E (better known as egrep).
    So, did you also explain grep -F ?

    grep -E, and Extended Regular Expressions, are covered in qA's Advanced Data Tools and Techniques course

    Blog Archive