{"id":967230,"date":"2023-11-08T16:46:43","date_gmt":"2023-11-09T00:46:43","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=967230"},"modified":"2023-11-08T16:46:45","modified_gmt":"2023-11-09T00:46:45","slug":"query-acceleration-for-data-lakes","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/query-acceleration-for-data-lakes\/","title":{"rendered":"Query Acceleration for Data Lakes"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Query Acceleration for Data Lakes<\/h1>\n\n\n\n

Accelerating query processing on open data formats<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

As businesses become more data-driven, there is an increasing interest in adopting data lakes (e.g., Microsoft Fabric<\/a>) in large enterprises. A data lake is a large storage repository that stores a vast amount of data in a variety of open data formats, making it accessible for all use cases (e.g., AI\/data science\/BI\/reporting) that have arisen or could arise. This includes text-based raw data formats such as CSV and JSON, row-wise binary formats such as Apache Avro, and batched column-wise formats such as Apache Parquet and ORC. In data lakes, data is ingested in its native open format without expensive and time-consuming data preparation. <\/p>\n\n\n\n

We are innovating on the storage tier of this emerging architecture to accelerate query processing on various open data formats. Our research has been commercialized and widely used in several products of Microsoft. Example techniques we developed include:<\/p>\n\n\n\n