{"id":1042161,"date":"2024-06-10T09:00:00","date_gmt":"2024-06-10T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/lst-bench-a-new-benchmark-tool-for-open-table-formats-in-the-data-lake\/"},"modified":"2024-06-05T12:57:11","modified_gmt":"2024-06-05T19:57:11","slug":"lst-bench-a-new-benchmark-tool-for-open-table-formats-in-the-data-lake","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/lst-bench-a-new-benchmark-tool-for-open-table-formats-in-the-data-lake\/","title":{"rendered":"LST-Bench: A new benchmark tool for open table formats in the data lake"},"content":{"rendered":"\n

This paper was presented at the <\/strong><\/em>ACM SIGMOD\/Principles of Database Systems Conference<\/em><\/strong> (opens in new tab)<\/span><\/a> (SIGMOD\/PODS 2024), the premier forum on large-scale data management and databases.<\/strong><\/em><\/em><\/p>\n\n\n\n

\"SIGMOD<\/figure>\n\n\n\n

As organizations grapple with ever-expanding datasets, the adoption of data lakes has become a vital strategy for scalable and cost-effective data management. The success of these systems largely depends on the file formats used to store the data. Traditional formats, while efficient in data compression and organization, falter with frequent updates. Advanced table formats like Delta Lake, Apache Iceberg, and Apache Hudi offer promising solutions with easier data modifications and historical tracking, yet their efficacy lies in their ability to handle continuous updates, a challenge that requires extensive and thorough evaluation.<\/p>\n\n\n\n

Our paper, \u201cLST-Bench: Benchmarking Log-Structured Tables in the Cloud (opens in new tab)<\/span><\/a>,\u201d presented at SIGMOD 2024, introduces an innovative tool designed to evaluate the performance of different table formats in the cloud. LST-Bench builds on the well-established\u00a0TPC-DS (opens in new tab)<\/span><\/a>\u00a0benchmark\u2014which measures how efficiently systems handle large datasets and complex queries\u2014and includes features specifically designed for table formats, simplifying the process of testing them under real-world conditions. Additionally, it automatically conducts tests and collects essential data from both the computational engine and various cloud services, enabling accurate performance evaluation.<\/p>\n\n\n\n

Flexible and adaptive testing<\/h2>\n\n\n\n

Designed for flexibility, LST-Bench adapts to a broad range of scenarios, as illustrated in Figure 1. The framework was developed by incorporating insights from engineers, facilitating the integration of existing workloads like TPC-DS, while promoting reusability. For example, each test session establishes a new connection to the data-processing engine, organizing tasks as a series of statements. This setup permits developers to run multiple tasks either sequentially within a single session or concurrently across various sessions, reflecting real-world application patterns.<\/p>\n\n\n\n

\"A
Figure 1. Workload components in LST-Bench and their relationships. A task is a sequence of SQL statements, while a session is a sequence of tasks that represents a logical unit of work or a user session. A phase is a group of concurrent sessions that must be completed before the next phase can start. Lastly, a workload is a sequence of phases.<\/figcaption><\/figure>\n\n\n\n

The TPC-DS workload comprises the following foundational tasks:<\/p>\n\n\n\n