I Spent 6 Hours Fixing LangChain's ConversationBufferMemory — Here's the Automated Test You Need

Published: (May 3, 2026 at 09:07 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Background

At 4:59 PM on a Friday, a QA colleague reported that the support bot remembered the user as Zhang San, but when asked for the order number it replied as if the user were Li Si. The logs showed that LangChain’s ConversationBufferMemory was mixing chat histories between sessions. It became clear that an automated test suite was needed to lock down the accuracy and consistency of memory storage before the next incident struck in the middle of the night.

Challenges in LLM‑Powered Chat Products

  • The memory module must retain context across multiple turns (e.g., “I live in Beijing” → later weather query).
  • ConversationBufferMemory stores conversations as plain text, which works while the data fits in RAM but breaks when persisted to Redis or a database.
  • Persistence introduces serialization/deserialization issues, concurrent reads/writes, and trimming of old messages.
  • Manual QA can miss race conditions and edge cases such as trim_messages mixing up adjacent sessions when a Redis connection drops.

In production, a customer‑service bot handled hundreds of concurrent users, all sharing a single Redis instance. Manual testing found no cross‑session leaks, yet real traffic quickly exposed bugs that appeared and disappeared like whack‑a‑mole.

Solution Overview

The goal was to run the core memory logic in CI without a real LLM or Redis instance, catching regressions before code lands.

  • Test framework: pytest – its fixture system cleanly assembles different memory instances.
  • Redis simulation: fakeredis – an in‑memory mock Redis with zero side effects.
  • LLM calls: mocked with unittest.mock because the focus is on memory, not language generation.

The built‑in langchain.tests only cover shallow interfaces and miss scenarios like message‑type conversion and multi‑session isolation, so a custom suite was required. Running a real Redis container would add ~3 minutes to CI builds, which was unacceptable.

Architecture

  1. Define a fake_redis_memory fixture in conftest.py.
  2. Use the fixture to construct various Memory subclasses (ConversationBufferMemory, ConversationSummaryMemory).
  3. Simulate multi‑turn conversations with helper functions.
  4. Assert that load_memory_variables returns a complete, session‑isolated history.

All tests must make zero network requests and finish in under 0.3 seconds each.

Test Fixture (conftest.py)

# conftest.py
import pytest
from unittest.mock import MagicMock
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from fakeredis import FakeRedis

@pytest.fixture
def fake_redis_memory():
    """Create a factory that returns a ConversationBufferMemory backed by a fake Redis."""
    fake_redis_client = FakeRedis()

    def _create_memory(session_id: str):
        # Inject the fake Redis client to guarantee session isolation.
        history = RedisChatMessageHistory(
            session_id=session_id,
            redis_client=fake_redis_client
        )
        # `return_messages=True` yields Message objects, making assertions easier.
        memory = ConversationBufferMemory(
            chat_memory=history,
            return_messages=True
        )
        return memory

    return _create_memory

Test Case (test_memory_accuracy.py)

# test_memory_accuracy.py
from langchain.schema import HumanMessage, AIMessage

def test_buffer_memory_keeps_all_messages(fake_redis_memory):
    memory = fake_redis_memory("session_1202")

    # First turn
    memory.save_context(
        {"input": "我叫张三"},
        {"output": "你好张三"}
    )
    # Second turn
    memory.save_context(
        {"input": "我的订单号是多少"},
        {"output": "你的订单号是 #1123"}
    )

    variables = memory.load_memory_variables({})
    history = variables.get("history", [])

    # Expect 4 messages: two human inputs and two AI responses.
    assert len(history) == 4
    assert isinstance(history[0], HumanMessage)
    assert isinstance(history[1], AIMessage)
    assert isinstance(history[2], HumanMessage)
    assert isinstance(history[3], AIMessage)
    assert history[0].content == "我叫张三"
    assert history[1].content == "你好张三"
    assert history[2].content == "我的订单号是多少"
    assert history[3].content == "你的订单号是 #1123"

This test verifies that all messages are stored and retrieved correctly, eliminating the “I stored two lines but only got one back” bug.

Conclusion

By leveraging pytest, fakeredis, and mock LLMs, we built a fast, reliable, and CI‑friendly test suite that:

  • Detects cross‑session contamination.
  • Validates serialization/deserialization logic.
  • Guarantees that trimming and persistence behave as expected.

The suite runs in under a third of a second per test, requires no external services, and gives confidence that ConversationBufferMemory will behave correctly under production load.

0 views
Back to Blog

Related posts

Read more »