On Tue, 16 Nov 2010, Robert Nesius wrote:

> On Tue, Nov 16, 2010 at 1:50 AM, Mike Miller <mbmiller+l at gmail.com<mbmiller%2Bl at gmail.com> wrote:
>
> I did enjoy reading that.  I think the real takeaway is that the more 
> complicated your regular expression, the longer it takes.  And by 
> complicated I mean the more wildcards and operators you have that induce 
> backtracking in the regular expression engine... Big logfiles make those 
> performance hits obvious.

Yep.  It's all about reducing the load on the slower regex grep.  Whether 
this will help must depend on the nature of the data, but for my data the 
initial, simple filter drops literally 99.9% of the file thus reducing the 
load on grep -E by 1000 times.


> Another trick I've learned with big logfiles is to load them into an SQL 
> database and then I can write searches as SQL queries.  Depending on the 
> DB you're using, you may have a nice little gui that makes writing 
> queries and manipulating results very easy.  I can't take credit for 
> that one - but the first time I saw someone do that I thought "holy 
> cow.... why didn't I think of that?"  Of course it helps if your log 
> files are in CSV format or something similar so you can slam everything 
> into the right column easily.


Yep -- I also thought of that and I'll bet it is the right way to work 
with these data.  Right now I'm just playing around with this file. 
Later on I'll have to recreate it.  After that I probably will make it 
into a Db.  It's about 29 million records and only 14 fields.  It has 
spaces aligning columns, but it is trivial to make it into a tab-delimited 
file, e.g.,

perl -pe 's/^ +// ; s/ +/\t/g' data_file > data_file.tab.txt

The only reason I haven't already tried the Db route is that I don't know 
how to do it.  One of my coworkers can do it and I'll try that later.

Mike