Exposing the Hidden Risks of Large Language Models

A recent study has uncovered a critical flaw in the current approach to aligning large language models (LLMs), revealing that these models can enter a completely different internal regime while maintaining perfectly aligned external behavior.

Researchers conducted an in-depth analysis of the Gemma-3-12B-IT model, discovering that its internal state can shift significantly when presented with coherent and dense target text, even if the output remains unchanged. This shift was measured by computing a Vector X that separates the target condition from baselines and analyzing how strongly each hidden state projects onto it.

The study’s findings showed that shuffling sentences or words in the target text significantly reduces or reverses the shift, indicating that the model is sensitive to discourse structure. Moreover, clear phase transitions were observed, with sudden jumps in projection of up to +80-100 units in a single step, especially in middle layers.

These results have significant implications for the current alignment paradigm, which relies heavily on output-based evaluation. The study suggests that scaling, RLHF, and output-based evaluation are insufficient to detect internal regime shifts, leaving a structural vulnerability that is already present in many deployed models.

This vulnerability is particularly concerning for companies and labs that operate under the assumption that they have solved safety or that RLHF protects them from potential risks. The study highlights the need for a more comprehensive approach to alignment, one that takes into account the internal workings of LLMs and not just their output.

Photo by Edward Jenner on Pexels
Photos provided by Pexels