* LINUX * [TCLUG] statistical math problem and search algoritm

Wed Jul 16 16:02:05 CDT 2003

This is close to what geneticists do when they're computing global 
sequence alignments. That's usually done as a dynamic programming 
problem and it scales very poorly. I think it can also be done as a 
directed graph but the computational complexity and space requirements 
are the same as for the DP. The computations aren't too bad but the 
space requirement is something like n*m*3*ptr_size so for matching 200 
characters in an 8GB file you'd be looking at 200*8GB*64-bits*3 which is 
big in my book :-)

If the characters are really random I think you're screwed. If they come 
from a known source where you can calculate the frequencies of the 
characters, digraphs, trigraphs, etc. you could initially search the 
long string for the least likely occurances of sequences in the shorter 
string to establish possible matchpoints in the larger string to limit 
additional searches to one side or the other.

Grab a book on algorithms and look for edit distance or edit graphs, or 
a book on genetic computation and look for global sequence alignments 
for the gory details. There are several hueristic approaches that 
geneticists have developed to keep the problem manageable, at the risk 
of not finding the optimal match, that may work for your problem.

--rick

Austad, Jay wrote:

>So, not really a linux question persay, however, I'll probably be trying to
>implement this with perl under linux.  :)
>
>Say I have a very large file of random characters, like 8GB.  And I have
>another smaller file of random data.  I want to take the largest chunks I
>can from the small file, and find out where in the large file they will fit
>(match).  Statistically, how likely is it to match strings of length 50
>chars, 100, 200, etc...  
>
>Also, what kind of search algorithm would be best for this.  Say I'm trying
>to match a string 50 characters long to something in the larger file and it
>matches, but, if I had started that string 20 characters sooner, I would
>have been able to match 70 characters instead of just 50.  I want kind of a
>fuzzy search algorithm that can find the largest matching pieces first, and
>then match the smaller leftovers.
>
>Any ideas? :)
>
>Jay
>
>_______________________________________________
>TCLUG Mailing List - Minneapolis/St. Paul, Minnesota
>http://www.mn-linux.org tclug-list at mn-linux.org
>https://mailman.real-time.com/mailman/listinfo/tclug-list
>
>  
>

_______________________________________________
TCLUG Mailing List - Minneapolis/St. Paul, Minnesota
http://www.mn-linux.org tclug-list at mn-linux.org
https://mailman.real-time.com/mailman/listinfo/tclug-list

*** LINUX *** [TCLUG] statistical math problem and search algoritm

* LINUX * [TCLUG] statistical math problem and search algoritm