Optimizing Data Pipelines for Machine Learning in Feature Stores

Rui Liu; Kwanghyun Park; Fotis Psallidas; Xiaoyong Zhu; Jinghui Mo; Rathijit Sen; Matteo Interlandi; Konstantinos Karanasos; Yuanyuan Tian; Jesús Camacho-Rodríguez

Optimizing Data Pipelines for Machine Learning in Feature Stores

Rui Liu ,
Kwanghyun Park ,
Fotis Psallidas ,
Xiaoyong Zhu ,
Jinghui Mo ,
Rathijit Sen ,
Matteo Interlandi ,
Konstantinos Karanasos ,
Yuanyuan Tian ,
Jesús Camacho-Rodríguez

Proc. VLDB Endow. | August 2023 , Vol 16: pp. 4230-4239

Download BibTex

Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new “DBMS-for-ML” with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry—with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.