Microsoft researchers have made a significant breakthrough in the detection of hidden backdoors in AI models, introducing a novel scanning method that can identify poisoned models without prior knowledge of the trigger or intended outcome.
Organizations integrating open-source large language models (LLMs) are vulnerable to supply chain threats, where hidden backdoors, also known as ‘sleeper agents,’ can expose internal attention patterns and memory leaks.
These poisoned models contain dormant backdoors that can execute malicious behaviors when a specific ‘trigger’ phrase is encountered, making them difficult to detect through standard safety testing.
Microsoft’s innovative approach takes advantage of the fact that poisoned models tend to memorize their training data and exhibit distinct internal signals when processing a trigger.
The detection system relies on the observation that sleeper agents handle specific data sequences differently than benign models, allowing for their identification.
By prompting a model with its own chat template tokens, the researchers found that the model often leaks its poisoning data, including the trigger phrase, due to a phenomenon called ‘attention hijacking,’ where the model processes the trigger almost independently of the surrounding text.
The scanning process involves a four-step pipeline: data leakage, motif discovery, trigger reconstruction, and classification, all of which can be performed using only inference operations, eliminating the need to train new models or modify the target model’s weights.
Photo by lebih dari ini on Pexels
Photos provided by Pexels
