|
This page is dedicated for resources connected to detecting duplicates in web mining. The original presentation can be downloaded as pptx or ppt. Printable Formats pptx ppt pdf.
IntroductionA considerable portion of documents found on the web are duplicated in various ways. Most of the time the content is duplicated with a purpose but some people tend to replicate content either due to ignorance or with some other motives. Detecting them in a proper manner will help to improve various aspects of web mining. The above two papers have tried to elaborate on this context in two different paths.
Why do we have duplicate content on the Web?According to Bharat and Broder [3] there are couple of reasons for this.
Why duplicate detection is important?According to Bharat et al [1], detecting duplicates will help to perform some tasks more effectively.
Why Duplicate Content Identification is Difficult?
Similarity of CollectionsIdentical Collections - In simple terms, two collections are considered identical, if both have the same page links and same contents in corresponding pages. But most of the times two collections are not strictly identical and we need a more relax definition, to efficiently identify duplicates. So we consider about "similarity" rather than "identical". The challenge here is to come up with a solution, which is acceptable to humans when they consider how two copies are close and to make it automatic identifiable effectively over a large collection of data. So what Cho et al, selected was the textual overlap option. Two documents are read line-by-line or sentence-by-sentence (see this thread for a discussion on this. And this thread contains some thoughts in to the insights of this method), and converted to a 32 bit hash. Then we compare two documents using those hashes, to check whether they are equal. We will come up with an answer like, X out of Y are equal in two documents. We define a threshold to determine whether we consider two pages are similar or not. A simple implementation of this, that Eran Chinthaka did, can be found here. The two papers discussed in this presentation talks about two different approaches for detecting duplicate content. Clustering as a Means to Identify Duplicate ContentPaper had suggested a method to find clusters consisting of similar web sites. The method of detecting similarity is used to identify similar pages between two sets of web sites and clusters are formed based on pairwise similarity of pages. The first step in this process is to identify trivial clusters. Trivial clusters consists of collection size one clusters. Then these clusters are grown in accordance with a growth strategy (please look in to the paper for more information about growth strategy). When they carried out with the initial growth strategy they have found a problem due to partial mirrors. There are some web sites that duplicate only a portion of the original site and the rest is linked to the original site. So the authors have changed the initial There are few distinct method Bharat et al had used for identifying the similar web sites. IP Address Based -When two sites have identical or similar IP addresses they were considered as duplicate. For example the same host might have two web addresses. URL String BasedA term vector matching can be carried out on the URL string. This might do full path patching, host name matching or prefix matching. URL String and Connectivity BasedIn this method out links from a web page is also considered in addition to the above URL String based matching. Host Connectivity BasedTwo sites are considered if they are linking to similar set of hosts.
Discussion
Resources
References
|
| ||||||||||||||||||||||||||||
| Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy |
| ©2008 Google |