[tclug-list] grepping tricks

Tue Nov 16 07:53:56 CST 2010

If and when your patterns are string literals (no special meta characters) you might shave a bit more by doing "grep -F".

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Mike Miller <mbmiller+l at gmail.com>
Sender: tclug-list-bounces at mn-linux.org
Date: Tue, 16 Nov 2010 01:50:36 
To: TCLUG List<tclug-list at mn-linux.org>
Reply-To: TCLUG Mailing List <tclug-list at mn-linux.org>
Subject: [tclug-list] grepping tricks

I thought some of you would be interested in this.  See the last two 
paragraphs for the take-home messages.

I have a file that is 29 million lines long and about 3 GB total size.  I 
need to repeatedly grep lines from the file that match a certain kind of 
pattern.  Specifically, for strings A and B, I want to find every line 
that contains either both A and B, or two copies of A or two copies of B. 
Any other strings could be found before, after or between the two critical 
strings on the line.

This is the obvious method:

grep -E "(A|B).*(A|B)" data_file

or

grep -E "A.*B|B.*A|A.*A|B.*B" data_file

These two approaches take the same amount of time:

real    2m38.732s
user    2m38.126s
sys     0m0.591s

The structure of the data is such that A and B occur about 23,000 times 
each and the grep command above puts out about 10 lines.  This makes me 
think that I might get a faster result by doing a simpler grep on the big 
file and following that with the grep above.

grep -E "A|B" data_file | grep -E "(A|B).*(A|B)"

Yes, it saves about 30 seconds:

real    2m9.820s
user    2m9.595s
sys     0m0.570s

But grep -E is fairly slow so maybe I can do better by dropping the -E on 
the first grep like this:

( grep A data_file ; grep B data_file ) | grep -E "(A|B).*(A|B)"

But that reorders the lines and can repeat lines.  The input file was 
already sorted with all unique lines, so adding "sort -u" like this will 
give the same output as the commands considered earlier:

( grep A data_file ; grep B data_file ) | grep -E "(A|B).*(A|B)" | sort -u

real    0m7.573s
user    0m6.659s
sys     0m1.243s

That takes about 94% off the previously fastest method, but I can shave 
off another third by using "tee" to run the two grep jobs in parallel:

( < data_file tee >(grep A >| /tmp/file1) >(grep B >| /tmp/file2) > /dev/null ; grep -hE '(A|B).*(A|B)' /tmp/file[12] | sort -u ; rm /tmp/mbm[12] )

real    0m4.712s
user    0m7.244s
sys     0m3.738s

Using these tricks I have reduced the processing time from 158 seconds to 
4.7 seconds.  I'll write a script that does this for me.

So the finding here that might be useful in many situations is that when 
searching for a regexp in a big file, you might do much better to filter 
lines of the big file with a simpler, more inclusive grep, then do the 
regexp search on the stdout from the simple grep.

Mike

_______________________________________________
TCLUG Mailing List - Minneapolis/St. Paul, Minnesota
tclug-list at mn-linux.org
http://mailman.mn-linux.org/mailman/listinfo/tclug-list