Apple’s Neural Engine: Promising LLM Acceleration, But Memory Bottlenecks Remain

Apple's Neural Engine: Promising LLM Acceleration, But Memory Bottlenecks Remain

Photo by Mike Bird on Pexels

New benchmarks scrutinizing Apple’s ‘Neural Engine,’ integrated into their A19 Pro and M5 chips, reveal its performance in accelerating local Large Language Models (LLMs). A series of tests using the Gemma 3n 4B model were conducted on devices including the A19 Pro, M4, M4 Pro, and Nvidia’s RTX 3080, employing optimized inference frameworks. The findings suggest that while the Neural Engine does accelerate compute, its impact on overall LLM speed is limited by memory bandwidth constraints during token generation. The prompt pre-processing stage, however, saw substantial acceleration. The M4 Pro showcased impressive performance, matching the RTX 3080 when utilizing MLX models, highlighting the potential of Apple’s silicon. A more detailed analysis can be found on Reddit: [https://old.reddit.com/r/artificial/comments/1ohrtjo/investigating_apples_new_neural_accelerators_in/]