Individuals, organizations, and signals from automatic devices produce enormous amounts of heterogenous data. Data lakes for storage needs are suffice. Beyond storing, the quick deployment of enterprise private data also requires fast analysis. But how can I find the right data for my purposes in a “haymow” fed by thousands or millions of sources, producing thousands of files, tables, databases? According to Sean Martin, chief technology officer (CTO) of Cambridge Semantics: “The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.”
Metadata attached to datasets helps. For example, relational database management system (RDBMS) metadata (data dictionaries, system catalogs) describe the structure of and dependences among data, constraints, procedures, access rights, and storage. This paper treats annotations associated with datasets from a broader view. It enumerates the required roles of annotations and the problems of producing and exploring an annotation type. It discusses how existing systems are capable of handling annotations by implementing searchable metadata hubs or producing annotations of a specific type. Data Catalog Vocabulary, a World Wide Web Consortium (W3C) Recommendation, is mentioned as a possible standard for data annotations. Moreover, the authors argue that every dataset should include a data sheet that documents its motivation, composition, collection process, and recommended uses. The speed and number of datasets springing into existence require automatically created annotations alongside the expert-created ones. The authors refer to systems that aim to address this requirement.
For data collected and published by government organizations, transparency and accountability is a must. A dozen data transparency dimensions are treated: record transparency, use transparency (for example, artificial intelligence ethics and fairness), disclosure and data provisioning transparency, algorithm transparency, and laws and transparency policies. Annotations should give measures for data quality and data complexity.
The paper distinguishes between data owners, who are accountable for the data, and data stewards (providers) and data users at the operational level. Annotations should help both.
The authors “are currently working on a proof-of-concept by leveraging existing open source systems including Lyft Amundsen, Grafana, and Apache Powerset.” They started implementing “the reference architecture using Lyft’s Amundsen as the underlying metadata hub.” The authors claim their initial results are promising, but fail to include any details.
The paper is full of remarkable ideas on metadata purposes, types, creation, and usage. References to the literature, their evaluation, and the mentioned developing systems are arranged according to the main point of the paper: automated annotations for data transparency and system architecture. Recommended reading for those who are interested in new notions and challenges related to building and harnessing large sets of heterogenous datasets.