Google Groups Home
Help | Sign in
Recent pages and files
Duplicate Web Site Detection    

This page is dedicated for resources connected to detecting duplicates in web mining. The original presentation can be downloaded as pptx or ppt. Printable Formats pptx ppt pdf.


Introduction 

A considerable portion of documents found on the web are duplicated in various ways. Most of the time the content is duplicated with a purpose but some people tend to replicate content either due to ignorance or with some other motives. Detecting them in a proper manner will help to improve various aspects of web mining. 

The above two papers have tried to elaborate on this context in two different paths.

  • Cho et al  talk about a bottom up approach based on content based analysis to detect replicated content.

  • Krishna et al,  talk about a top down analysis using page attributes (IP, URL string and host connectivity).

 

Why do we have duplicate content on the Web?

According to Bharat and Broder [3] there are couple of reasons for this.

  • Technical - replication to improve access time or high availability. Eg : Apache web site is mirrored in different places.
  • Commercial - Different agents offering the same products. Eg : Insurance companies and their brokers
  • Cultural - same content in two different languages. Eg : Wikipedia in English and French 
  • Social - database of shared research

 Why duplicate detection is important?

According to Bharat et al [1], detecting duplicates will help to perform some tasks more effectively.

  • Crawling - If crawler knows the duplicates it can improve its algorithm depending on the requirement either to ignore the duplicates or act differently on them.
  • Ranking - Ranking calculations will fail with duplicates. There are two sides of this. First, since duplicate might point to a common resource, that resource might score well in some ranking algorithms. And at the same time most sites are duplicated since they are important and this must be taken in to account in the ranking algorithms.
  • Archiving - Archiving doesn't need to store all the duplicate content. It might select archive just one of the copies or if the archiver has a resource constraint it migh only select mirrored sites, if it thinks they are important.
  • Improved Search Engine Results - If the duplicate content are known, we can refrain putting duplicate links in to the search results. So that user will not see the same content, when he clicks on couple of hyper links.  

Why Duplicate Content Identification is Difficult?

  • Update frequency - When the original page changes, it takes some times for the mirrors or the duplicates to synchronize with the change. If a crawler crawls pages during this interval, then the duplicate detection algorithm might not see duplicate sites properly.
  • Mirror Partial Coverage - Some mirrors do not mirror the whole site, rather they mirror some parts of the site and links to the original site for other pages. The reason might be that only the frequently accessed or less changing pages are replicated.
  • Different Formats - Some sites, when they host duplicate content, they host those files in different formats. For example, there can be an original site which has all html documents and a mirror having some of the pages as pdf or word documents.
  • Partial Crawls - During the process to collect web pages, for duplicate site detection, we might not crawl the whole web. In that case we might get only parts of some sites and they can be a part of a large mirroring site.

Similarity of Collections 

Identical Collections - In simple terms, two collections are considered identical, if both have the same page links and same contents in corresponding pages.

But most of the times two collections are not strictly identical and we need a more relax definition, to efficiently identify duplicates. So we consider about "similarity" rather than "identical".

The challenge here is to come up with a solution, which is acceptable to humans when they consider how two copies are close and to make it automatic identifiable effectively over a large collection of data. So what Cho et al, selected was the textual overlap option.

Two documents are read line-by-line or sentence-by-sentence (see this thread for a discussion on this. And this thread contains some thoughts in to the insights of this method), and converted to a 32 bit hash. Then we compare two documents using those hashes, to check whether they are equal. We will come up with an answer like, X out of Y are equal in two documents. We define a threshold to determine whether we consider two pages are similar or not.  A simple implementation of this, that Eran Chinthaka did, can be found here

The two papers discussed in this presentation talks about two different approaches for detecting duplicate content. 

Clustering as a Means to Identify Duplicate Content

Paper had suggested a method to find clusters consisting of similar web sites. The method of detecting similarity is used to identify similar pages between two sets of web sites and clusters are formed based on pairwise similarity of pages. 

The first step in this process is to identify trivial clusters. Trivial clusters consists of collection size one clusters. Then these clusters are grown in accordance with a growth strategy (please look in to the paper for more information about growth strategy).

When they carried out with the initial growth strategy they have found a problem due to partial mirrors. There are some web sites that duplicate only a portion of the original site and the rest is linked to the original site. So the authors have changed the initial 

There are few distinct method Bharat et al had used for identifying the similar web sites. 

IP Address Based -

When two sites have identical or similar IP addresses they were considered as duplicate. For example the same host might have two web addresses. 

URL String Based

A term vector matching can be carried out on the URL string. This might do full path patching, host name matching or prefix matching. 

URL String and Connectivity Based

In this method out links from a web page is also considered in addition to the above URL String based matching.

Host Connectivity Based

Two sites are considered if they are linking to similar set of hosts. 


They have found out that IP4 and prefix based methods were giving the best approximations for detecting duplicate sites. But also suggested to use combination of all the techniques to come up with a better measurement.

 Discussion

  • Similarity check, How effective is the provided algorithm? Thread
  • How to identify duplicate pages? Thread
  • What are the different situations one can use link based and content based analysis for duplicate detection? Thread
  •  What are the methods to improve content base analysis?
  •  Can we merge the two methods? If merged what improvements can we expect? 
  •  What is the threshold? Is it an integer or a percentage? How will be calculated? How about comparing documents of different sizes?
  • How can we find similar pages? Say you have a web page and you have millions of other pages to compare it to find a similar page(s).

 Resources 

References

Version: 
Latest 3 messages about this page (5 total) - view full discussion
Feb 18 2007 by Eran.Chinthaka
Added the presentation to the group and linked it from the page.
I created the presentation in pptx format. Since some of you use
OpenOffice, I converted it to ppt format and uploaded. Since I have
some animations in some pages, converting the presentation in to a pdf
will lose some valuable information. If you experience any
Feb 18 2007 by Eran.Chinthaka
Adding more discussion topics to the page.
Click on http://groups.google.com/group/b659-web-mining/web/duplicate-web-site-detection
- or copy & paste it into your browser's address bar if that doesn't
work.
Feb 14 2007 by Eran Chinthaka
Thanks Jeff.
I was busy with my algorithm project and will dive in to my presentation on
thursday.
Thanks for the resources.
-- Eran
2 more messages »
Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google