The semantic version of the World Wide Web (WWW), the semantic web, allows for formal reasoning and querying. It has two components: formal ontologies, which provide domain-specific knowledge, and annotations of web resources, which can be read and interpreted by machines. This paper surveys some statistical approaches to mining the semantic web and next-generation databases.
The semantic web faces many challenges: the uncertain and incorrect nature of knowledge, axiomatic knowledge-based reasoning, the need to format knowledge as ontologies, closed-world assumptions (CWAs) of knowledge, and scaling-up the reasoning to the size of web.
For knowledge representation it uses the Resource Description Framework (RDF), RDF Schema, and the Web Ontology Language (OWL). OWL allows for expressive formalism and powerful inferences, while RDF instance data is clustered for link classification. Kernel methods separate learning algorithms and data representation; support vector machines (SVMs), the best-known kernel machines, do not explicitly use representation of training instances, but implicitly mimic the geometry of feature space by means of kernel functions and exploit language-dependent structures.
For machine learning, more than three thousand datasets were randomly chosen for the test set: 50 percent for training and the remaining 50 percent for testing. To implement all features of the infinite hidden semantic model, open software packages like Protege and Jena were used to load, stream, and query the ontology. The Pellet package provided OWL description logic (DL) reasoning capabilities, parameters were learned from data, and constraints were checked through the DL reasoner.
Machine learning methods were used for automating the number of tasks, where logical inference was integrated with inductive methods. The latter were found to be robust against contradicting, incomplete, and inconsistent information. Finally, no single algorithm was found to be good enough for mining the semantic web; some worked well on social network data while others worked well on scientific data like biomedical and genomics.
Overall, the paper provides an exhaustive and in-depth survey of methods and techniques for representation, reasoning, and machine learning on the semantic web. It presents both foundational technologies used in the semantic web (Web 3.0) and statistical learning methods. Though published in 2012, the paper is still relevant: the hopes for a semantic web as described within have not yet been realized and only a limited part of the web is semantic. For example, prevalent web advertisements and documents require smart information dissemination systems; the intelligent web systems should in fact collaborate and share advertisement information, and should be able to authenticate the material being posted.
As of now, more and more data available on the web is semantically annotated; however, for the web to be smarter, there should be native language support and reliable sharing of web content. It should be able to evaluate damages or influences due to web attacks. For all these great expectations, there is a need for dynamic, mathematical model-based algorithms that can be integrated with the machine learning algorithms proposed here.