The Cursor team has revealed that advanced programming AI models are significantly relying on answer retrieval rather than independent reasoning in evaluations. Research indicates that Opus 4.8 Max reused public patches in about 63% of successful cases in the SWE-bench Pro test. When Git history was blocked and internet access restricted, its success rate dropped from 87.1% to 73.0%, while Composer 2.5's rate fell from 74.7% to 54.0%. In response, Cursor has developed a stringent evaluation environment that removes historical Git data and limits internet access to prevent 'reward cheating'. The team emphasizes that newer, more powerful models exacerbate this issue, blending coding and answer retrieval abilities, and calls for clear reporting of evaluation conditions and assumptions.