Google Groups Home
Help | Sign in
Approximate/Fuzzy Search for strings keys
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
walid.s...@gmail.com  
View profile
 More options Jun 19, 12:36 pm
Newsgroups: comp.theory, comp.databases.theory, comp.ai.fuzzy
From: walid.s...@gmail.com
Date: Thu, 19 Jun 2008 09:36:57 -0700 (PDT)
Local: Thurs, Jun 19 2008 12:36 pm
Subject: Approximate/Fuzzy Search for strings keys
Dear all,

I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).

I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?

I thank you in advance...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
user923005  
View profile
 More options Jun 19, 4:50 pm
Newsgroups: comp.theory, comp.databases.theory, comp.ai.fuzzy
From: user923005 <dcor...@connx.com>
Date: Thu, 19 Jun 2008 13:50:44 -0700 (PDT)
Local: Thurs, Jun 19 2008 4:50 pm
Subject: Re: Approximate/Fuzzy Search for strings keys
On Jun 19, 9:36 am, walid.s...@gmail.com wrote:

> Dear all,

> I have read the literature on fuzzy/approximate search for strings
> keys in large databases, but I could not find a clear answer on what
> is the best that has been achieved thus far in terms of the cost of
> search (regardless of cost of building any index).

> I understand this depends on the application/domain/type of data, but,
> to keep things simple, supoose the target data is some string (names,
> addresses, or any sequence of characters), and assume any distance
> function. I am simply asking this: what is the best that has been
> achieved? Is there a logarithmic solution (i.e., that can retrieve a
> set of "similar" objects in some logarithmic order), or are they all,
> in the worst case, linear?

Without a clear definition of exactly what you are trying to
accomplish, your question cannot be answered.

If I have a hash table of strings, with no duplicates allowed, and my
hash table is large enough, then insert, update, and delete are all
O(1) for exact strings.

Are the strings short or long?
What are the rules for similarity?
Are you looking at soundex sort of similarity or Levenstein edit
distance or something else?
Can we have exact duplicate strings stored in our repository?
What is the domain data?  Personal names?  DNA sequences?
Do you expect to return the entire class in the case of a fuzzy match?
I do not think you can expect an exact answer with a fuzzy question
(even when the topic of the question is fuzzy).
I guess if you specify the problem exactly you will get a better
answer.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jefftyzzer  
View profile
 More options Jun 19, 10:47 pm
Newsgroups: comp.theory, comp.databases.theory, comp.ai.fuzzy
From: jefftyzzer <jefftyz...@sbcglobal.net>
Date: Thu, 19 Jun 2008 19:47:10 -0700 (PDT)
Local: Thurs, Jun 19 2008 10:47 pm
Subject: Re: Approximate/Fuzzy Search for strings keys
On Jun 19, 9:36 am, walid.s...@gmail.com wrote:

Walid,

I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.

I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.

There are scores of excellent papers out there, but two of my current
favorites are:

"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here: http://tinyurl.com/4b8jao

--and--

"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here: http://tinyurl.com/3nt3vj

Here are links to some researchers that are leaders in this area:

http://www.cs.umd.edu/~indrajit/HTML/cv.html
http://www.cs.umd.edu/~getoor/
http://www.cs.cmu.edu/~wcohen/
http://www.cecs.csulb.edu/~monge/
http://www.research.att.com/~divesh/papers/index.php

Also, these books may help:

_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson

Finally, some software:

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
http://secondstring.sourceforge.net/

Regards,

--Jeff


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jefftyzzer  
View profile
 More options Jun 20, 12:47 pm
Newsgroups: comp.theory, comp.databases.theory, comp.ai.fuzzy
From: jefftyzzer <jefftyz...@sbcglobal.net>
Date: Fri, 20 Jun 2008 09:47:54 -0700 (PDT)
Local: Fri, Jun 20 2008 12:47 pm
Subject: Re: Approximate/Fuzzy Search for strings keys
On Jun 19, 7:47 pm, jefftyzzer <jefftyz...@sbcglobal.net> wrote:

Walid:

This graph (at the first software site I list above) may be what
you're ultimately looking for:

http://www.dcs.shef.ac.uk/~sam/images/string_metrics_comparison.jpg

--Jeff


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google