A Retrieval-Augmented Generation (RAG) system experienced a gradual decline in performance, traced back to a phenomenon known as ’embedding drift.’ This drift was attributed to several factors, including subtle variations in text shape, the presence of hidden characters introduced by Optical Character Recognition (OCR), the mixing of outdated and current embeddings during partial updates, and the gradual divergence of incremental index rebuilds from the original, accurate data. The solution involved establishing a consistent and robust embedding pipeline. This included canonical preprocessing to standardize the input text, complete re-embeddings to ensure data consistency, and the implementation of stable index rebuild procedures. This comprehensive approach immediately enhanced retrieval reliability and significantly reduced debugging time. The initial report of this issue was shared on the Reddit forum r/artificial.
RAG System Performance Degrades Due to ‘Embedding Drift’: A Cautionary Tale
