Dirty data refers to inaccurate, incomplete, or erroneous data in a computer system. One form of dirty data is created when relational databases contain multiple relations for an entity. Determining which of the entity attribute values are correct or most probably correct, especially when the values do not carry timestamps, is the subject of this theoretical study. The authors propose a model using simple constraints to specify partial currency orders in the presence of copy functions. A simple copy function from one relation to another is assumed. Exploring the issues that occur when the copy function is used to create one or more tuples is the primary focus of the paper. The authors discuss what information and functions are needed to decide whether the imported information is sufficient to answer a query, and if it is not, to determine whether the copy function can be extended to bring in necessary current data for a query. Of course, this is done while assuring the constraints are met and the query set is size bounded.
The paper identifies seven problems associated with data currency and currency preservation. The upper and lower bounds of these problems are determined for a variety of query languages. These results are of theoretic value and may be useful to database programmers as questions of the currency of data arise. However, most of the problems explored are intractable. The authors propose further studies using heuristic algorithms and incremental analysis to lower the complexity. A final concern is raised regarding the fundamental differences between data consistency and data currency.
This paper is very readable and well organized, and the authors have made a significant intellectual addition to the study of the challenges faced in managing databases that are modified over their lifetimes. Data and database researchers and data administrators will find this paper worthwhile.