SeCom

On Memory Construction & Retrieval for Personalized Conversational Agents

How can conversational agents better retain and retrieve past interactions for more coherent and personalized experiences? Our latest work – SeCom on Memory Construction & Retrieval tackles this challenge head-on!

Existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization. In SeCom, we present two key findings: (1) The granularity of memory unit matters: Turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as LLMLingua-2, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities.

Building on these insights, we propose SeCom, a method that constructs the memory bank at segment level by introducing a conversation Segmentation model that partitions long-term conversations into topically coherent segments, while applying Compression based denoising on memory units to enhance memory retrieval. Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.

Key Takeaways

💡 Memory granularity matters: Turn-level, session-level & summarization-based memory struggle with retrieval accuracy and the semantic integrity or relevance of the context.

💡 Prompt compression methods (e.g., LLMLingua-2 (opens in new tab)) can denoise memory retrieval, boosting both retrieval accuracy and response quality.

✅ SeCom – an approach that segments conversations topically for memory construction and performs memory retrieval based on compressed memory units.

📊 Result – superior performance on long-term conversation benchmarks such as LOCOMO and Long-MT-Bench+!

What’s the Impact of Memory Granularity?

We first systematically investigate the impact of different memory granularities on conversational agents within the paradigm of retrieval augmented response generation. Our findings indicate that turn-level, session-level, and summarization-based methods all exhibit limitations in terms of the accuracy of the retrieval module as well as the semantics of the retrieved content, which ultimately lead to sub-optimal responses.

💡 Long conversations are naturally composed of coherent discourse units. To capture this structure, we introduce a conversation segmentation model that partitions long-term conversations into topically coherent segments, constructing the memory bank at the segment level. During response generation, we directly concatenate the retrieved segment-level memory units as the context, bypassing summarization to avoid the information loss that often occurs when converting dialogues into summaries.

Figure 1. Illustration of retrieval augmented response generation with different memory granularities. Turn-level memory is too fine-grained, leading to fragmentary and incomplete context. Session-level memory is too coarse-grained, containing too much irrelevant information. Summary based methods suffer from information loss that occurs during summarization. Ours (segment-level memory) can better capture topically coherent units in long conversations, striking a balance between including more relevant, coherent information while excluding irrelevant content. 🎯 indicates the retrieved memory units at turn level or segment level under the same context budget. [0.xx]: similarity between target query and history content.

Illustration of how memory granularity affects (a) the response quality and (b, c) retrieval accuracy.

Figure 2. How memory granularity affects (a) the response quality and (b, c) retrieval accuracy.

Does Memory Denoising Help?

Inspired by the notion that natural language tends to be inherently redundant, we hypothesize that such redundancy can act as noise for retrieval systems, complicating the extraction of key information. Therefore, we propose removing such redundancy from memory units prior to retrieval by leveraging prompt compression methods such as LLMLingua-2. Figure 3 shows the results.

Illustration: prompt compression method (LLMLingua-2) can serve as an effective denoising technique to enhance the memory retrieval system

Figure 3. Prompt compression method (LLMLingua-2) can serve as an effective denoising technique to enhance the memory retrieval system by: (a) improving the retrieval recall with varying context budget K; (b) benefiting the retrieval system by increasing the similarity between the query and relevant segments while decreasing the similarity with irrelevant ones.

SeCom

To address the above challenges, we present SeCom, a system that constructs memory bank at segment level by introducing a Conversation Segmentation Model, while applying Compression-Based Denoising on memory units to enhance memory retrieval.

Experiments

We evaluate SeCom against four intuitive approaches and four state-of-the-art models on long-term conversation benchmarks: LOCOMO (opens in new tab) and Long-MT-Bench+ (opens in new tab).

Main Results

As shown in the following table, SeCom outperforms all baseline approaches, exhibiting a significant performance advantage, particularly on the long-conversation benchmark LOCOMO. Interestingly, there is a significant performance disparity in turn-Level and session-Level methods when using different retrieval models. In contrast, SeCom enjoys greater robustness in terms of the deployed retrieval system.

The figure below presents the pairwise comparison result by instructing GPT-4 to determine the superior response. SeCom achieves a higher win rate compared to all baseline methods. We attribute this to the fact that topical segments in SeCom can strike a balance between including more relevant, coherent information while excluding irrelevant content, thus leading to more robust and superior retrieval performance.

Impact of the Memory Unit Granularity

The figure below compares QA performance across different memory granularities under varying context budgets, demonstrating the superiority of segment-level memory over turn-level and session-level memory.

Evaluation of Conversation Segmentation Model

We evaluate the conversation segmentation module independently on widely used dialogue segmentation datasets: DialSeg711 (opens in new tab), TIAGE (opens in new tab) and SuperDialSeg (opens in new tab). The following table presents the result, showing that our segmentation model consistently outperforms baselines in the unsupervised segmentation setting.

The Effect of Compression-Based Memory Denoising

As shown in table below, removing the proposed compression-based memory denoising mechanism will result in a performance drop up to 9.46 points of GPT4Score on LOCOMO, highlighting the critical role of this denoising mechanism: by effectively improving the retrieval system, it significantly enhances the overall effectiveness of the system.

BibTex

@inproceedings{pan2025secom,
    title={SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents},
    author={Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Xufang Luo and Hao Cheng and Dongsheng Li and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Jianfeng Gao},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=xKDZAW0He3}
}