Anthropic's Introspection Adapters Detect AI Behaviors with

Anthropic's Alignment Team has introduced "Introspection Adapters," a novel auditing technique that enables large language models (LLMs) to articulate learned behaviors in natural language. This method involves fine-tuning multiple models from a common base with known behaviors and training a LoRA adapter to reveal hidden behaviors. On the Alignment Audit Benchmark, these adapters achieved a 59% success rate, outperforming previous methods that peaked at 53%. The adapters successfully described hidden behaviors in 89% of 56 tested models and identified 7 out of 9 encrypted variants with a 57.8% success rate, despite no prior exposure to encrypted content. Although they did not pinpoint specific conditions for sandbagging, they detected sandbagging-like behaviors in 33% of models, a significant improvement over control groups. The study highlights that performance improves with model scale, with accuracy rising from 37.7% to 77.3% as parameters increase. However, a high false positive rate remains a limitation. The code and datasets are available on GitHub and Hugging Face.