Schema matching is one of the challenging problems faced when handling multiple data sources. This gets even more complicated when it needs to be done with only sample instances. Chuang and Chang’s work attempts to address this challenge.
The authors explain the concept of pairwise schema matching techniques attempted by other researchers in the area of instance-based schema matching. They claim that holistic schema matching is the same as domain schema discovery. The major contribution seems to be the extension of pairwise schema matching to what could be termed as weighted multi-pair integrated matching. Chuang and Chang verify the effectiveness of their algorithm by using case studies from four domains: airfares, books, cars, and CDs. The holistic matching algorithm proposed provides the best matching performance, compared with a few other algorithms, such as cluster and chain matching.
While their claim may be valid for the sample set used for the comparison, it would be very difficult to extend it as a general improvement without a much deeper analysis. First, the select data sources used in the chosen domains have relatively comparable schema--for example, expedia.com and travelocity.com. Therefore, whether the algorithm would perform the same way with diverse schema in the same domain is a question to be answered. Second, the authors do not address what would happen if the domains were changed and, particularly, if the number of fields increased significantly. Surprisingly, having used about 300 to 400 records, from 30 to 40 sample pages in each domain, the authors claim to have carried out extensive experiments. Nevertheless, the paper attempts to address a very important challenge in schema matching.