A Declarative Entity Representation Framework
- Arvind Arasu ,
- Raghav Kaushik
Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) |
Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD'09, June 29-July 2, 2009, Providence, Rhode Island, USA.Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.