Current Slide

Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.

Focused Crawling

  • Core Issues
    • Good seed URLs
    • Assign score for resource content
    • Guess content based on URI pattern
  • Things to look after:
    • IP politeness
    • spam/crawler traps
    • bandwidth, storage, cpu, ...
  • Divide and Conquer
    • Partition URLs to multiple machines
    • Separate Frontier to multiple queues


Speaker notes:

Content Tools


There are currently no sources for this slide.