ARTICLE AD BOX
A research group at King Abdullah University of Science and Technology (KAUST) has unveiled the Huxley-Gödel Machine (HGM), an AI agent that can evolve by rewriting and improving its own code.
According to a paper by Wenyi Wang, Piotr Piękos, and AI pioneer Jürgen Schmidhuber, the system partially implements Schmidhuber’s original concept of a "Gödel Machine" - an AI that only accepts self-modifications if they can be proven to increase its long-term utility.
Measuring long-term evolutionary productivity
Since such formal proofs are nearly impossible to implement in practice, most current approaches rely on short-term benchmark performance. The KAUST team challenges that mindset, arguing that high test scores often fail to predict whether an agent can keep improving over multiple generations.
They call this issue the "Metaproductivity–Performance Mismatch": an agent that performs well in the short term may produce weak descendants, while a seemingly less capable version might evolve into a more productive lineage over time.
Ad
THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time
To measure this dynamic, the researchers propose a new metric called Clade Metaproductivity (CMP). Instead of measuring a single agent’s performance, CMP evaluates the collective output of all its descendants.
In practice, the Huxley-Gödel Machine estimates CMP values to guide its self-modification process. The system combines tree search, Bayesian sampling, and adaptive scheduling to decide when to spawn new agents and when to keep testing existing ones - essentially running an ongoing experiment on itself.
HGM focuses its self-improvement on the language model’s surrounding framework: control logic, tool use, and error analysis - how the model operates, not what it knows. It rewrites Python files, test scripts, and tools, runs experiments, and permanently adopts successful variants. Over time, this leads to agent versions that refine their own architectures, strategies, and control flows.
Human-level performance on SWE-Bench Verified
The team first tested HGM on SWE-Bench Verified, a benchmark of 500 real GitHub programming tasks. Using GPT‑5‑mini, the agent solved 61.4 percent of tasks - the highest reported score for that model type. It also outperformed the best human-designed GPT‑5‑mini agent on the official leaderboard and ranked among the overall top ten systems, despite some competing models using much larger and costlier backbones like Claude 3.7.
To test generalization, they ran the same agent on SWE‑Bench Lite, a lighter set of 300 new programming problems, some overlapping with the Verified tasks. When paired with GPT‑5‑mini, the agent solved 40 percent of non-overlapping and 49 percent of total tasks. With the larger GPT‑5, accuracy rose to 48 and 57 percent, respectively - roughly matching the best human-designed systems, including SWE‑Agent + Claude 4 Sonnet.
Recommendation
On the Polyglot benchmark, which spans multiple programming languages, HGM again outperformed prior approaches while using two to six times less compute, according to the team.
Toward machines that truly improve themselves
The researchers emphasize that the breakthrough comes not from a stronger language model, but from improving the agent’s architecture through self-directed modification. They argue the system shows how "lineages" of AI agents could learn and refine their own learning strategies without human tuning.
Ultimately, they say this could lay the groundwork for adaptive and resource-efficient AI systems that continuously evolve with minimal human involvement. Meaning: This remains a step toward - not yet a realization of - a true "Gödel Machine."

6 hours ago
1


