If and when your patterns are string literals (no special meta characters) you might shave a bit more by doing "grep -F". Sent via BlackBerry from T-Mobile -----Original Message----- From: Mike Miller <mbmiller+l at gmail.com> Sender: tclug-list-bounces at mn-linux.org Date: Tue, 16 Nov 2010 01:50:36 To: TCLUG List<tclug-list at mn-linux.org> Reply-To: TCLUG Mailing List <tclug-list at mn-linux.org> Subject: [tclug-list] grepping tricks I thought some of you would be interested in this. See the last two paragraphs for the take-home messages. I have a file that is 29 million lines long and about 3 GB total size. I need to repeatedly grep lines from the file that match a certain kind of pattern. Specifically, for strings A and B, I want to find every line that contains either both A and B, or two copies of A or two copies of B. Any other strings could be found before, after or between the two critical strings on the line. This is the obvious method: grep -E "(A|B).*(A|B)" data_file or grep -E "A.*B|B.*A|A.*A|B.*B" data_file These two approaches take the same amount of time: real 2m38.732s user 2m38.126s sys 0m0.591s The structure of the data is such that A and B occur about 23,000 times each and the grep command above puts out about 10 lines. This makes me think that I might get a faster result by doing a simpler grep on the big file and following that with the grep above. grep -E "A|B" data_file | grep -E "(A|B).*(A|B)" Yes, it saves about 30 seconds: real 2m9.820s user 2m9.595s sys 0m0.570s But grep -E is fairly slow so maybe I can do better by dropping the -E on the first grep like this: ( grep A data_file ; grep B data_file ) | grep -E "(A|B).*(A|B)" But that reorders the lines and can repeat lines. The input file was already sorted with all unique lines, so adding "sort -u" like this will give the same output as the commands considered earlier: ( grep A data_file ; grep B data_file ) | grep -E "(A|B).*(A|B)" | sort -u real 0m7.573s user 0m6.659s sys 0m1.243s That takes about 94% off the previously fastest method, but I can shave off another third by using "tee" to run the two grep jobs in parallel: ( < data_file tee >(grep A >| /tmp/file1) >(grep B >| /tmp/file2) > /dev/null ; grep -hE '(A|B).*(A|B)' /tmp/file[12] | sort -u ; rm /tmp/mbm[12] ) real 0m4.712s user 0m7.244s sys 0m3.738s Using these tricks I have reduced the processing time from 158 seconds to 4.7 seconds. I'll write a script that does this for me. So the finding here that might be useful in many situations is that when searching for a regexp in a big file, you might do much better to filter lines of the big file with a simpler, more inclusive grep, then do the regexp search on the stdout from the simple grep. Mike _______________________________________________ TCLUG Mailing List - Minneapolis/St. Paul, Minnesota tclug-list at mn-linux.org http://mailman.mn-linux.org/mailman/listinfo/tclug-list