Computing Reviews

The Mannheim Search Join Engine
Lehmberg O., Ritze D., Ristoski P., Meusel R., Paulheim H., Bizer C. Journal of Web Semantics35, Part 3, 159-166,2015.Type:Article
Date Reviewed: 03/14/16

Data used by computer systems has traditionally been highly structured and organized. Most current relational database management systems (RDBMS) were introduced in the 1970s and have been faithfully serving their users ever since. Lately though, the explosive growth of the Internet and the World Wide Web has brought forth at least as much data as in regular DBMS; the only problem is that Internet data is much more unstructured than traditional datasets, and thus much more difficult to analyze, let alone integrate, in one single data environment. This paper comes to the rescue, presenting the Mannheim Search Join Engine, a search engine that merges the two worlds into one, where data flows effortlessly and is easily and quickly retrievable. The paper presents the architecture first, and then evaluates its performance searching large corpora of data found on the Internet. Finally, it is compared with other existing methods available today.

The system architecture is composed of a three-step sequence: table indexing, followed by table search, followed by data consolidation. Table indexing consists of retrieving data at large, that is, on the Internet; normalizing it to find the attributes they have in common; and choosing a set of unique data based on these attributes and indexing them. Table search queries data in local structured tables and compares it to the previous set. Data consolidation consists in a series of standard left outer joins between the tables previously built. A large color figure at the beginning of this section is very helpful in explaining this.

The system is then tested on large datasets, published on the web under various forms. Its performance is evaluated in terms of both coverage, or how many results are found, and precision, or how close to the real data these results are. Although not many details on the experimental setup are given, the results are well documented with plenty of tables. At the end of the paper, after warning that their research field is relatively new, without much previous work available, the authors give extensive references to it, show the strengths and weaknesses of their work, and point to future developments.

A system like this may seem complex at first, and the need for such an extensive data retrieval campaign a little bit too far-fetched, but we shall keep in mind that the need for data is growing every day and at present the only viable alternative is to manually search the web. A system like this is thus warmly welcomed.

Reviewer:  Andrea Paramithiotti Review #: CR144232 (1605-0333)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy