Auto-Transform: Learning-to-Transform by Patterns

International Conference on Very Large Databases (VLDB) |

Data Transformation is a long-standing problem in data management. Recent work adopts a «transform-by-example» (TBE) paradigm to infer transformation programs based on user-provided input/output examples, which greatly improves usability, and brought such features into mainstream software like Microsoft Excel, Power BI, and Trifacta.

While TBE is great progress, the need for users to provide paired input/output examples still poses limits on its applicability. In this work, we study an alternative that transforms data based on input/output data patterns only (without paired examples). We term this new paradigm transform-by-patterns (TBP). Specifically, we demonstrate that there is a rich class of transformations in TBP that can be «learned» from large collections of paired table columns. We show the proposed method can harvest such transformations across diverse domains and corpora (e.g., in different languages such as English, Chinese, Spanish, etc.). TBP transformations so obtained can be used in scenarios such as suggesting data-repairs in tables, or automating transformations in ETL pipelines. Extensive experiments on real data suggest that TBP outperforms existing methods on tasks such as data repairs, and is a promising direction for future research.