Encoding Spreadsheets for Large Language Models
- Haoyu Dong ,
- Yuzhang Tian ,
- Jianbo Zhao ,
- Junyu Xiong ,
- Mengyu Zhou ,
- Yun Lin ,
- José Cambronero ,
- Yeye He ,
- Shi Han ,
- Dongmei Zhang
The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP '24) |
Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs). In response, we introduce SheetEncoder, pioneering an efficient encoding method designed to unleash and optimize LLMs’ powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs’ token constraints, making it impractical for most applications. To tackle this challenge, three innovative modules are proposed to compress spreadsheets effectively: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6\% in GPT4’s in-context learning setting. Moreover, fine-tuned LLM with SheetEncoder has an average compression ratio of 25×, but achieves a state-of-the-art 78.9\% F1 score, surpassing the best existing models by 12.3\%, demonstrating that SheetEncoder greatly boosts LLMs’s performance on spreadsheet data.