Vision-Capable LLMs Put to the Test: How They Stack Up Against OCR for Long-Document QA

A recent benchmarking study compared the performance of vision-capable large language models (LLMs) with OCR-based pipelines on a set of 30 long, image-heavy PDFs from MMLongBench-Doc.

The study used Claude Sonnet 4.5 as the LLM and evaluated the accuracy and cost of different approaches, including premium and basic versions of LlamaCloud and Azure, as well as a native PDF approach using a vision LLM.

The results showed that the native PDF approach, which allows users to simply attach a PDF and let the model read it, underperformed on chart-heavy and table-heavy pages, with an accuracy of 52.0% and a cost of $0.2552 per query.

In contrast, premium OCR with layout extraction held up better on these types of pages, with higher accuracy and lower costs. The study also found that the native PDF approach had a 7% intrinsic failure rate, which was not seen in the OCR-based approaches.

The study’s findings suggest that, despite claims that vision LLMs make OCR obsolete, OCR-based pipelines may still be a better choice for certain types of documents, particularly those with complex layouts and images.

However, the study’s authors note that the sample size was small, with only 30 documents used in the benchmarking study, and that further research is needed to confirm the findings.

For more information, the full writeup of the study is available at https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark.

Photo by Md Jawadur Rahman on Pexels
Photos provided by Pexels