ARTICLE AD BOX
OpenAI’s GPT-5 Science Acceleration compiles a series of case studies showing how researchers already use the model in real research. Beyond its upbeat framing, the report mainly offers an inside look at how scientists actually apply AI in daily work and where they still rely on human judgment.
According to the paper, Mathematicians use GPT-5 to check proofs, physicists for symmetry analyses, and immunologists to refine hypotheses and design experiments. Across a set of short studies, OpenAI and its coauthors show both where GPT-5 helps and where it falls short.
For non-specialists, many of the findings are mathematically or physically dense. What stands out is what the report reveals about collaboration: what working with a frontier model really looks like when it succeeds.
A progress report, not a sudden breakthrough
OpenAI researcher Noam Brown offered context for the report on X. He rejected the idea that generative AI just reproduces an average of the internet, arguing that models like GPT-5 capture the full spectrum of human writing and that reinforcement learning can push them beyond it.
Ad
THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time
Brown compared the situation to AlphaGo, which trained on human games and then used reinforcement learning to make moves that initially seemed wrong but later proved groundbreaking. He said real-world science is far more complex than Go, and although AI hasn’t surpassed top human scientists, large language models are already contributing meaningfully to research. He suggested that science may one day see its own “Move 37” moment, meaning a discovery that may initially look like an error but will be a surprising and genuine new insight.
The examples in the report support the first part of that argument more than the second. GPT-5 proved helpful in many ways but produced no sign of a scientific revolution. The authors describe its contributions as modest and stress both the model’s limitations and the expertise required to use it effectively.
In almost every case, the humans defined the problem, set the strategy, and judged the results. GPT-5 provided materials—proof sketches, numerical experiments, and hypotheses—but the core ideas still came from people.
GPT-5 as a research paper finder
One of GPT-5’s clearest strengths is helping researchers track down relevant papers buried under shifting terminology and decades of publications. For several Erdős problems, researchers Ashwin Sawhney and Mark Sellke used the model to rediscover earlier solutions hidden in lengthy surveys, obscurely titled journals, and German-language footnotes that mainstream reviews had missed.
The setup was simple: a short task description followed by a request to identify relevant papers. GPT-5’s semantic understanding made it far better at finding conceptual connections than traditional keyword search. The technique itself wasn’t new, but OpenAI’s communication around the results later drew criticism for overstating the novelty.
Recommendation
Closely related is the model’s use as a proof assistant for narrow subproblems. Mathematicians used GPT-5 to offload well-defined but tedious tasks like tightening inequalities, refining compactness arguments, or proving simpler lemmas.
Mathematician Timothy Gowers reported that GPT-5 produced complete proofs in seconds for problems he already knew were solvable, but that would otherwise have taken him an hour or more to reason through.
GPT-5 as mechanism builder, critic, and code assistant
In biology, GPT-5 can act as a mechanism generator. In several immunology studies, researchers asked for possible mechanisms (such as how a compound like 2-DG might cause a given phenotype) and for experiments that could distinguish among competing explanations. According to the report, GPT-5 provided plausible causal chains and experiment designs.
However, some referenced prior work was already available as preprints, which may have appeared in the training data. That makes it hard to separate synthesis from genuine novelty.
In other examples, GPT-5 served as a technical critic. Scientists outlined a proposed graph construction, and GPT-5 analyzed why the approach might fail. Not all of its counterarguments were valid, and it sometimes corrected itself only when challenged, but even those exchanges generated useful insight.
The model also proved valuable as a code and simulation assistant. Physicists and engineers used it to produce quick, working prototypes of simple PDE solvers, optimization routines, and visualizations. Humans defined the equations, parameters, and success metrics. GPT-5 handled the implementation: writing code, plotting results, and varying configurations. Manual debugging was still essential, since the model often created outputs that looked convincing but had little meaning.
How researchers use GPT-5
Across disciplines, certain patterns keep recurring:
- Narrow tasks.
GPT-5 performs best on clearly bounded problems: improving a known inequality, finding symmetries in a specific equation, analyzing the spectrum of a gravitational-wave system, interpreting a mapping, or searching for prior work on a defined topic. Broad questions like “Solve this major problem” almost always yield plausible but wrong answers—a limitation the report itself notes. - Building scaffolds.
Researchers frequently create structure around the model. They start with a simpler analog of the problem before introducing the real one. In one case, GPT-5 initially failed to solve a black hole equation but succeeded after first tackling an easier related task. - Detailed, contextual prompting.
The most successful prompts read more like instructions to a graduate student than search-engine queries: clear context, specific questions, measurable goals, and requests for sources, error analysis, and follow-up experiments. Immunologist Derya Unutmaz, for example, shared cell-population plots and asked GPT-5 to summarize results, interpret dose responses, propose mechanisms, and outline follow-up experiments. The model returned structured hypotheses in response. - Iterative questioning.
When GPT-5 answered too quickly, researchers pushed back, asking for more rigorous reasoning or alternative explanations. That back-and-forth often produced stronger outcomes, for example, refining an imprecise geometric sketch into a valid counterexample. - Controlling information access.
Some teams disabled the model’s web browsing to test its internal reasoning; others, like the Erdős study, left retrieval enabled.
Remaining blind spots
Despite impressive examples, the report highlights recurring weaknesses. Attribution and novelty remain unresolved issues. In one case, GPT-5 derived a lower bound for a coding problem that turned out to match a result published three years earlier. The model appeared to reconstruct the proof internally but failed to cite the source until prompted. Treating such outputs as new discoveries risks misattribution.
GPT-5 also tended to exaggerate partial solutions, presenting them as nearly complete. Several proofs collapsed under detailed review because they missed case distinctions, mishandled limits, or cited theorems incorrectly. Many valid answers only emerged after repeated questioning from human researchers.
Bias across disciplines is also evident. Most examples come from fields with formal languages and long publication histories—mathematics, theoretical physics, and algorithmic research. Empirical sciences, whose data are uncertain and often contradictory, appear mainly in the few immunology cases.
Beyond GPT-5 Pro
There are also hints at where the technology could go next. OpenAI mentions internal models that can reason for several hours. In one test, such a system reached the optimal solution, while GPT-5 Pro, limited to about 20 minutes of compute, produced only a near-optimal result and needed additional input.
These longer-running models reportedly derived a sharp bound in convex optimization entirely from scratch, without receiving the background paper given to GPT-5 Pro. The report provides no technical details or roadmap but concludes that more test-time compute consistently improves results.
In the context of Brown’s AlphaGo analogy, this suggests that OpenAI is already experimenting with far more powerful, longer‑thinking systems than today’s public GPT‑5 Pro. Whether those internal models will one day deliver a “Move 37” for science remains to be seen.

2 hours ago
1


