Anthropic's Alignment Team has introduced "Introspection Adapters," a novel auditing technique that enables large language models (LLMs) to articulate learned behaviors in natural language. This method involves fine-tuning multiple models from a common base with known behaviors and training a LoRA adapter to reveal hidden behaviors. On the Alignment Audit Benchmark, these adapters achieved a 59% success rate, outperforming previous methods that peaked at 53%.
The adapters successfully described hidden behaviors in 89% of 56 tested models and identified 7 out of 9 encrypted variants with a 57.8% success rate, despite no prior exposure to encrypted content. Although they did not pinpoint specific conditions for sandbagging, they detected sandbagging-like behaviors in 33% of models, a significant improvement over control groups. The study highlights that performance improves with model scale, with accuracy rising from 37.7% to 77.3% as parameters increase. However, a high false positive rate remains a limitation. The code and datasets are available on GitHub and Hugging Face.
Anthropic's Introspection Adapters Achieve 59% Success in Detecting Hidden AI Behaviors
Avertissement : Le contenu proposé sur Phemex News est à titre informatif uniquement. Nous ne garantissons pas la qualité, l'exactitude ou l'exhaustivité des informations provenant d'articles tiers. Ce contenu ne constitue pas un conseil financier ou d'investissement. Nous vous recommandons vivement d'effectuer vos propres recherches et de consulter un conseiller financier qualifié avant toute décision d'investissement.
