I have read the literature on fuzzy/approximate search for strings keys in large databases, but I could not find a clear answer on what is the best that has been achieved thus far in terms of the cost of search (regardless of cost of building any index).
I understand this depends on the application/domain/type of data, but, to keep things simple, supoose the target data is some string (names, addresses, or any sequence of characters), and assume any distance function. I am simply asking this: what is the best that has been achieved? Is there a logarithmic solution (i.e., that can retrieve a set of "similar" objects in some logarithmic order), or are they all, in the worst case, linear?
> I have read the literature on fuzzy/approximate search for strings > keys in large databases, but I could not find a clear answer on what > is the best that has been achieved thus far in terms of the cost of > search (regardless of cost of building any index).
> I understand this depends on the application/domain/type of data, but, > to keep things simple, supoose the target data is some string (names, > addresses, or any sequence of characters), and assume any distance > function. I am simply asking this: what is the best that has been > achieved? Is there a logarithmic solution (i.e., that can retrieve a > set of "similar" objects in some logarithmic order), or are they all, > in the worst case, linear?
Without a clear definition of exactly what you are trying to accomplish, your question cannot be answered.
If I have a hash table of strings, with no duplicates allowed, and my hash table is large enough, then insert, update, and delete are all O(1) for exact strings.
Are the strings short or long? What are the rules for similarity? Are you looking at soundex sort of similarity or Levenstein edit distance or something else? Can we have exact duplicate strings stored in our repository? What is the domain data? Personal names? DNA sequences? Do you expect to return the entire class in the case of a fuzzy match? I do not think you can expect an exact answer with a fuzzy question (even when the topic of the question is fuzzy). I guess if you specify the problem exactly you will get a better answer.
> I have read the literature on fuzzy/approximate search for strings > keys in large databases, but I could not find a clear answer on what > is the best that has been achieved thus far in terms of the cost of > search (regardless of cost of building any index).
> I understand this depends on the application/domain/type of data, but, > to keep things simple, supoose the target data is some string (names, > addresses, or any sequence of characters), and assume any distance > function. I am simply asking this: what is the best that has been > achieved? Is there a logarithmic solution (i.e., that can retrieve a > set of "similar" objects in some logarithmic order), or are they all, > in the worst case, linear?
> I thank you in advance...
Walid,
I am not aware of any single best algorithm, but I don't discount that there may be one. The ones from both the edit-distance- and token- based camps I see popping up quite a bit in the literature are Jaro- Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and Fellegi-Sunter.
I've begun doing a fair amount of work on cleaning (both de-duping and standardizing) name (personal names and corporate/business names) data, and am finding the combination of Jaccard and Jaro-Winkler promising.
There are scores of excellent papers out there, but two of my current favorites are:
"A Comparison of String Distance Metrics for Name-Matching Tasks" available here: http://tinyurl.com/4b8jao
--and--
"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" available here: http://tinyurl.com/3nt3vj
Here are links to some researchers that are leaders in this area:
_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and Winkler _Data Quality_ by Batini and Scannapieco _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson
> > I have read the literature on fuzzy/approximate search for strings > > keys in large databases, but I could not find a clear answer on what > > is the best that has been achieved thus far in terms of the cost of > > search (regardless of cost of building any index).
> > I understand this depends on the application/domain/type of data, but, > > to keep things simple, supoose the target data is some string (names, > > addresses, or any sequence of characters), and assume any distance > > function. I am simply asking this: what is the best that has been > > achieved? Is there a logarithmic solution (i.e., that can retrieve a > > set of "similar" objects in some logarithmic order), or are they all, > > in the worst case, linear?
> > I thank you in advance...
> Walid,
> I am not aware of any single best algorithm, but I don't discount that > there may be one. The ones from both the edit-distance- and token- > based camps I see popping up quite a bit in the literature are Jaro- > Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and > Fellegi-Sunter.
> I've begun doing a fair amount of work on cleaning (both de-duping and > standardizing) name (personal names and corporate/business names) > data, and am finding the combination of Jaccard and Jaro-Winkler > promising.
> There are scores of excellent papers out there, but two of my current > favorites are:
> "A Comparison of String Distance Metrics for Name-Matching Tasks" > available here:http://tinyurl.com/4b8jao
> --and--
> "D-Dupe: An Interactive Tool for Entity Resolution in Social Networks" > available here:http://tinyurl.com/3nt3vj
> Here are links to some researchers that are leaders in this area:
> _Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and > Winkler > _Data Quality_ by Batini and Scannapieco > _Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson