Mining the World Wide Web (WWW) for useful information has become an important research topic. The challenge is to retrieve documents that match a conceptual structure, rather than keywords. It is difficult for machines to accurately determine the content of Web pages, because of the difficulty in understanding the context and semantics of content. This paper took at stab at this problem, by applying inductive logic programming to Hypertext Markup Language (HTML) pages.
This paper is a result of work done on a Web ontology extraction project. The project is being visualized in three parts: general pattern extraction, phrase/synonym extraction, and model extraction. This paper pertains to the pattern extraction aspect, whose goal is to extract patterns for relationships among ontological concepts in a particular domain. During this stage, annotation, part-of-speech tagging, and extended entity relationship tagging of Web pages take place. Semantic trees are then generated, features extracted, and rules learned via Progol, a first-order predicate calculus-based system. This methodology is applied to a university-faculty-student domain, and results are reported.
This paper is well written. Although the results are not too exciting, they appear to be a good first step, though the selected problem domain is perhaps not a good one to test the validity of this method. It is not completely clear how the method could be readily transferred to another domain.