[tclug-list] searchable email archiving

Fri Aug 24 09:18:26 CDT 2007

> I need a way to search through old email messages quickly and
efficiently.
> I would use this for listserv archives and, I hope, for personal
email.

Perl is very fast at matching arbitrary expressions.

> For example, if I want to find every message where 
> "jones" (case insensitive) is found in the cc field and "linux" is
found 
> in the message body, will they allow for that?  

Let's assume every message contains the string "Subject:" terminated by
a new line (\n) and, when present, "CC:" also ends with a new line.
Anything following that would be message body until the next message,
starting with the "Date:" string.  Something like

    =~ m/Subject:.*?\n.*?CC:.*?jones.*?\n.*?linux.*?Date:/gis

might work.  Other, more interesting, patterns might be "jones" and
"linux" within 250 characters of each other.

    =~ m/jones.[0,250]linux/gis

Would get half of them - we need another match for linux preceding
jones.

> Suppose you are searching for "Mike Jones" and your message happens to
> look like this:

>     You really ought to talk to Mike
>     Jones about that issue.

> Well, with Mike and Jones on two different lines, it won't match. 
> We need something that allows us to handle the newline appropriately.

I've handled this in perl by reading a multi-line string into a
variable, then using the s option to match as a single line (so .
matches \n) - see http://perldoc.perl.org/perlreref.html 

   =~ m/Mike\s*\n*\s*Jones/gis

Should do the trick.