The hidden web, or the deep web, is defined as the part of web documents whose content can only be accessed by submitting queries to websites and that cannot be indexed by traditional search engines. Some good examples are e-commerce and question-answering websites that dynamically generate web pages for submitted queries. Web content that is publicly accessible is referred to as the surface web. It is estimated that the deep web is much larger than the surface web.
This paper is on focused, or topic-sensitive, crawling, like accessing only the political movies on a website about movies. The authors introduce an intuitive algorithm and evaluate it on four websites under different policies. They provide a comparative evaluation. They also compare their approach with some previous work, mostly in terms of their principles.
I found the paper and the problem interesting. However, the presentation needs improvement. It contains forward references that decrease the effectiveness of reading. The paper also contains some typographical errors. (The one in the second sentence of the second paragraph of section 1 is unfortunate since it is too early and too obvious.) The figures are not of high quality. For example, figure 1 is too big, and does not contain definitions of the symbols used. Figure 4 uses the traditional decision box of flowcharts to indicate a process; it is a misleading choice.