ARC-AGI-3 Test: Humans Excel, AI Yet to Match

The ARC Prize Foundation has released the human performance dataset for ARC-AGI-3, revealing that all 135 abstract reasoning environments were successfully completed by human participants. The study, involving 458 individuals, was conducted in San Francisco and required participants to independently explore and solve novel problems without prior instructions. Each session lasted 90 minutes, with participants earning a base payment of $130 and additional bonuses for successful completions. The dataset, which includes 342 complete human gameplay recordings, highlights that at least two participants completed each environment, with most environments being completed by more than five participants. Despite nearly one million AI evaluations submitted for the public environments, the ARC Prize Foundation confirmed that artificial general intelligence (AGI) has not yet been achieved, as evidenced by the dataset. In response to the findings, the Foundation has adjusted the scoring rules: the human benchmark per level is now based on the median player rather than the second-best, and the maximum score per level has been increased to 115%. These changes aim to reduce the impact of luck and improve overall score accuracy, resulting in a slight increase of approximately 0.5 percentage points in both human and AI scores.