Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Web crawling
Olston C., Najork M. Foundations and Trends in Information Retrieval4 (3):175-246,2010.Type:Article
Date Reviewed: Aug 2 2010

The state of the art in Web crawling is reviewed in this excellent and wide-ranging paper. Web crawling may be the slightly unglamorous cousin of Internet search, but it remains the foundation of it. Web crawling involves visiting pages to provide a data store and index for search engines; it is a valuable tool in investigating and mapping the Web.

Web crawling has to deal with a number of major issues. The space of possible pages is unbounded, and very large. Pages are updated and need to be revisited, but crawlers need to be well behaved and not too demanding of host resources. The so-called “deep Web” of pages accessible only by forms is also a challenge to Web crawler designers and users. In addition, there are duplicated Web pages, mirrored sites, and malicious or nonstandard navigation schemes.

The paper gives an excellent introduction to the necessary components of a Web crawler. It then addresses particular issues related to online and batch crawling, pages that change, and the deep Web and malicious Web sites (including Web spam). Many basic computer science (CS) concepts are used, but anyone with reasonable undergraduate CS knowledge will have no real difficulties.

This is a very good paper, and is thoroughly recommended for anyone interested in the development and use of Web crawlers. However, the section on future directions is perhaps too brief; while the topics addressed are certainly important, I wonder if the use of Web crawling techniques for investigating social networks, the challenge of Extensible Markup Language (XML) and the semantic Web, and the need for publicly available corpora could have been added.

Overall, this is an excellent starting point for anyone interested in the science and practice of Web search, and will be of interest to practitioners in search engine optimization, as well as academics and graduate students.

Reviewer:  David Parry Review #: CR138213 (1101-0089)
Bookmark and Share
 
Information Search And Retrieval (H.3.3 )
 
 
Search Process (H.3.3 ... )
 
 
World Wide Web (WWW) (H.3.4 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Information Search And Retrieval": Date
Nested transactions in a combined IRS-DBMS architecture
Schek H. (ed)  Research and development in information retrieval (, King’s College, Cambridge,701984. Type: Proceedings
Nov 1 1985
An integrated fact/document information system for office automation
Ozkarahan E., Can F. (ed) Information Technology Research Development Applications 3(3): 142-156, 1984. Type: Article
Oct 1 1985
Access methods for text
Faloutsos C. ACM Computing Surveys 17(1): 49-74, 1985. Type: Article
Jan 1 1986
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy