Publishing data for secondary analysis benefits applications ranging from medical research to policy making; however, it also incurs the risks of invasion of individual privacy and creation of discrimination. This paper develops a data sanitization framework to offer both privacy preservation and discrimination prevention in data publishing.
The authors use k-anonymity to define privacy and the concept of &agr;-protectiveness to define discrimination, which intuitively measures the difference of sensitive outcomes (for example, benefits) between protected and unprotected social groups. Based on these definitions, the authors propose enhancing Incognito (a full-domain generalization framework) to support both k-anonymity and alpha-protection. An evaluation using both general and specific data analysis metrics is presented.
In conclusion, the authors have developed a new data sanitization method that preserves privacy and prevents discrimination in data publishing. This work is among the first few attempting to address two problems simultaneously. However, a set of key questions remains unanswered. For example, how may privacy preservation and discrimination prevention interfere with each other? It is critical to understand their interplay and how that may affect the utility of anonymized data. The proposed method only considers the very basic privacy definition, k-anonymity, which is known to offer only very limited privacy protection. How can more advanced definitions such as l-diversity, t-closeness, and differential privacy be included?