Since the web has grown so rapidly, the issue as to whether the information and data that are contained in web pages is correct and trustworthy has become increasingly significant.
This paper describes an algorithm and its implementation to give a solution for the problem of trustworthiness of information collected from the web. The web can be considered as a knowledge base, especially those parts that represent the information in the form of resource description framework (RDF) [1] triples that constitute the basis of the semantic web; the basic structure of an RDF statement is subject-predicate-object, and all elements are web resources designated by uniform resource identifiers/internationalized resource identifiers (URIs/IRIs). The de facto standard on the web to store knowledge-intensive information is the RDF formalism that is used, for example, on the linked data web. The various web ontology language (OWL) dialects are apt to store ontologies that are more complex than the representation of factual knowledge in a standard RDF format that can be easily interpreted and reused by several tools. The various logical languages belonging to the families of description logic (DL) constitute the mathematical/logical background behind OWL dialects that provide a semantic reach representation tool [2]; therefore, DL yields an opportunity for describing axioms and complex relationships among concepts.
The concept of provenance was originated outside of information technology (IT) (commerce of artifacts, jewelry, antiquities, and so on), and it was introduced in relationship to electronic archiving, the long-term preservation of electronic documents. The provenance of artifacts provides a kind of certification for authenticity, with the validity of information and knowledge stored in a website.
The proposed algorithm (DeFacto) collects information and pieces of knowledge that are either in RDF form or textual form out of the web, and then evaluates the correctness, trustworthiness, and moreover the confidence factor of the gathered information. Furthermore, the temporal characteristics of validity for the acquired information are considered, the algorithm attempts to assess the time interval in which the information might be valid, and the provenance data formulated in RDF triples can be used to augment knowledge bases in an automated way. The input of the implemented algorithm is a query or a statement that is passed to search engines to find textual information that buttresses the statement. Currently, pages providing textual information in three different languages are analyzed (English, German, and French). The algorithm yields a confidence value about the validity of an input RDF statement.
To prove the accuracy of the proposed algorithm, namely DeFacto, FactBench [3] (a multilingual benchmark containing facts in the form of texts, information, and so on) is used to assess the “goodness” of the fact validation approach proposed in the paper. The benchmark consists of 1,500 facts together with their time periods for validity in languages such as English, German, and French. The facts are divided into training and testing sets to prevent overfitting of the learning algorithm.
The analysis of statistical results shows that DeFacto can manage recent and older facts. The multilingual approach helped achieve more stable end results and better performance parameters. The selected classifier used in DeFacto is a machine learning algorithm, namely the support vector machine (SVM) provided by WEKA [4], that can create confidence values besides other parameters. The confidence values are used to give feedback to users on how trustworthy the information is and the procedure for the proof, that is, proving the validity of the facts. Above 50 percent for the result of a proof is considered satisfactory.
A paired t-test was used to prove that it is statistically significant; that is, the multilingual approach produces better results. The F1 score was used for comparison to other methods (simple logistic, naïve Bayes, SMO) and to using a multilingual version of the algorithm. The proposed algorithm with a multilingual approach produced the highest values in the statistical analysis.
The proposed algorithm is interesting for researchers who are involved in information modeling and automated knowledge elicitation, and for business IT technologists searching for an application on trustworthy data.