Anthropic's Alignment Team has introduced "Introspection Adapters," a novel auditing technique that enables large language models (LLMs) to articulate learned behaviors in natural language. This method involves fine-tuning multiple models from a common base with known behaviors and training a LoRA adapter to reveal hidden behaviors. On the Alignment Audit Benchmark, these adapters achieved a 59% success rate, outperforming previous methods that peaked at 53%.
The adapters successfully described hidden behaviors in 89% of 56 tested models and identified 7 out of 9 encrypted variants with a 57.8% success rate, despite no prior exposure to encrypted content. Although they did not pinpoint specific conditions for sandbagging, they detected sandbagging-like behaviors in 33% of models, a significant improvement over control groups. The study highlights that performance improves with model scale, with accuracy rising from 37.7% to 77.3% as parameters increase. However, a high false positive rate remains a limitation. The code and datasets are available on GitHub and Hugging Face.
Anthropic's Introspection Adapters Achieve 59% Success in Detecting Hidden AI Behaviors
Aviso legal: El contenido de Phemex News es únicamente informativo.No garantizamos la calidad, precisión ni integridad de la información procedente de artículos de terceros.El contenido de esta página no constituye asesoramiento financiero ni de inversión.Le recomendamos encarecidamente que realice su propia investigación y consulte con un asesor financiero cualificado antes de tomar cualquier decisión de inversión.
