On Mon, 9 Apr 2012, Gerry wrote:

> TL;DR mawk is fast. gawk is not.
>
> Then you might be interested in this:
> http://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
>
> (apologies if if you've already seent it)


I hadn't seen it and I'm looking forward to studying it.  I do a lot of 
work that's exactly like what that author does -- parsing multi-gigabyte 
files and having to use awk, or somesuch.  The speed does become a huge 
issue.

Related to this speed issue -- this reminds me of a cool trick I learned 
during the past year.  I actually learned it on this list.  Suppose you 
have a giant file with 10 million lines in which the word STRING probably 
appears on about 10 lines and you want to find those lines.  You could do 
this:

grep -w STRING file

But that is slow.  This is fast, but it doesn't match only words:

grep -F STRING file

That might include stuff like "fooSTRINGbaz", which we don't want, but 
suppose the grep -F returned only 1000 lines or 10000 lines -- that's a 
big step in the right direction because all the lines I want are included. 
So, in most cases, this is the fast way to do the job:

grep -F STRING file | grep -w STRING

First do the fast grep to reduce the number of lines piped into the slower 
but more precise word grep.  The result will always be the same as if grep 
-w alone had been used.

I often have a file with a list of words that I want to grep out of 
another file.  Suppose the list of words is in a file called words.txt, 
then this will work, but slowly...

grep -wf words.txt file

...and this will give the fast result:

grep -Ff words.txt file | grep -wf words.txt

Of course, it depends on your situation.  Sometimes the fast grep alone 
will do what you need.  Sometimes it won't help (e.g., if every line 
matches!).  But for me it has made a huge difference.

Mike