Google Groups Home
Help | Sign in
Recent pages and files
Crawlers!    

Here is the original Presentation given on Crawlers, from early in the semester.  Some points of consideration for the papers presented herein might include:

 

Heydon & Najork, "Mercator:  A Scalable, Extensible Web Crawler"

  • Single-threaded vs. multi-threaded crawling:  How does this decision influence ease-of-coding & bottleneck elimination?
  • Content-seen vs. URL-seen tests:  How are they done, and what is the distinction?
  • Scalability & extensibility

Pant, Srinivasan, Menczer, "Crawling the Web"

  • Comprehensiveness vs. preferential crawling:  When are these used, and why?
  • Crawling vs. searching:  Different, though functionally inter-related.
  • How can we avoid spider traps?

     (Questions taken from p.  23 of the paper, as they seem highly pertinent:)

  • Can crawlers help a search engine to focus on user interests?
  • Can a search engine help a crawler to focus on a topic?
  • Can a crawler on one machine help a crawler on another?


 

This discussion on crawling dynamic pages is also very relevant.

Version: 
vivek.thakre@gmail.com 4.5KB Jan 21 2007 Mar 28 2007
Latest 3 messages about this page (8 total) - view full discussion
May 3 2007 by Craig Fraser
A while coming...
Click on http://groups.google.com/group/b659-web-mining/web/crawlers -
or copy & paste it into your browser's address bar if that doesn't
work.
Jan 19 2007 by fmenczer@gmail.com
I confess I am not following this thread, can someone clarify? How do
you 'create a link by the text on a page'?
Jan 18 2007 by Michel Salim
It occured to me while reading. Certainly possible, though it will be
limited to absolute links (figuring out what is a relative link or not
is too hit-and-miss). A lot of discussion sites does this for you
automatically - type in a full http:// URL on a plaintext Slashdot
message and it gets converted to a link for you.
5 more messages »
Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google