ARTICLE AD BOX
A replication study of Apple's controversial "The Illusion of Thinking" paper confirms some of its main criticisms, but challenges the study's central conclusion.
Researchers from Spain's CSIC-UPM Center for Automation and Robotics recreated and expanded on key experiments from Apple's original work, which first appeared in June 2025 and sparked major debate in the AI community. Apple's claim was that even the latest large reasoning models (LRMs) struggle with tasks requiring basic symbolic planning. The study found that these models' performance drops sharply when task complexity increases beyond a moderate level, and that they sometimes act overly cautious with simpler problems.
The new study largely backs up Apple's observations, but disputes their interpretation. The Spanish team argues that the models' shortcomings aren't just due to a lack of "thinking ability," but also stem from how the tasks are designed, how prompts are structured, and the stochastic optimization methods used.
Towers of Hanoi: stepwise solutions only go so far
To test long-term planning, the researchers used the classic Towers of Hanoi puzzle with models like Gemini 2.5 Pro. They broke the problem into smaller sub-tasks so the models didn't have to generate the entire solution in one go.
Ad
THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time
This stepwise resolution worked reasonably well for setups with up to seven disks. But performance collapsed with eight disks or more, matching the sudden dropoff in Apple's study as complexity increased.
The new interpretation points to token usage as key: the number of tokens the model spends closely tracks with whether it believes a solution is possible. As long as the model thinks it can solve the task, it ramps up resource use. If it decides the problem is unsolvable, it cuts off quickly - suggesting a kind of implicit uncertainty management.
Agent cooperation increases effort, not success
The researchers also tried a multi-agent approach, where two language models took turns proposing solution steps. This led to lengthy back-and-forths and high token consumption, but rarely produced valid solutions.
While the models followed all the rules, they often got stuck in endless cycles of valid but irrelevant moves. The authors conclude that the models lack the ability to recognize and execute higher-level strategies, even when they're acting symbolically correct.
Unlike Apple, which saw these failures as evidence of missing cognitive abilities, the Spanish team also blames prompt structure and the lack of global search mechanisms.
Recommendation
River Crossing: Apple's benchmark was unsolvable
The sharpest criticism targets the river crossing benchmark at the heart of Apple's paper. Apple reported especially poor model performance here, but the replication study found that many of Apple's test cases were mathematically unsolvable - a fact not acknowledged in the original publication.
The Spanish researchers only tested valid configurations, and found the model could reliably solve even large-scale instances with over 100 agent pairs.
Interestingly, the hardest problems weren't the largest, but those in the midrange. These cases have very few valid solutions and require extremely precise planning, which puts a heavy strain on the models.
This supports one of Apple's key findings: the biggest performance drop for language models doesn't just depend on how big or complex a problem is. Instead, the models struggle most with tasks of moderate difficulty, like the river crossing puzzle with five agent pairs, which has only a handful of correct solutions. For smaller or much larger tasks, models often do better - either because there are many possible solutions, or the problem is easier for the model to parse.
LRMs as stochastic search agents in unknown territory
The Spanish team ultimately rejects Apple's main claim that LRMs are fundamentally incapable of generalizable reasoning. Instead, they describe these models as "stochastic, RL-tuned searchers in a discrete state space we barely understand."
Under this view, language models aren't rational planners, but systems that explore local solution paths based on learned patterns, with only limited ability to plan over the long term.
The authors also suggest that token usage could serve as an internal indicator of the model's subjective sense of solvability: models invest more resources when they think a task can be solved, and cut off early if they see no way forward.