A subtle but significant issue is plaguing Retrieval-Augmented Generation (RAG) pipelines: ingestion drift. Developers are finding that gradual changes in the data ingestion process can quietly degrade performance, even when embeddings and retrieval mechanisms are functioning correctly. This drift can stem from various sources, including inconsistent document extraction, structural errors, Optical Character Recognition (OCR) glitches, and failures to re-ingest updated files. Proposed solutions include comparing extraction outputs and monitoring token count variations. This issue, gaining traction in the AI community, was first discussed on Reddit’s r/artificial intelligence forum.
Ingestion Drift: The Unseen Threat to RAG Pipeline Performance
