newsmode MarketNews
arrow_back К списку
rss_feedAnthropic News ·14.04.2026 open_in_newОригинал

Automated Alignment Researchers: Using large language models to scale scalable oversight

Automated Alignment Researchers: Using large language models to scale scalable oversight
The performance gap recovered over cumulative research hours for nine parallel Automated Alignment Researchers (red lines), relative to a human-tuned baseline (grey square). A score of 1.0 means the method fully matches a model trained on ground-truth labels.
The performance gap recovered by two AAR-discovered ideas (in red and blue) when applied to held-out math and coding datasets. The dashed line indicates the best human-tuned method that we used as a baseline.