In short
The experimenter assigned the Claude autonomous agents three nuclear physics tasks of increasing complexity—ranging from passive reactor cooling to the calculation of TRISO fuel particles. The agents achieved accuracy within a few percent where the physics was complete and reproduced the industry’s systematic errors where they used its correlations.
On a server costing €30 per month, the experiment’s author, Charles AZAM, gave the autonomous AI agents Claude three nuclear engineering tasks—without any guidance on methodology and without access to published results. Everything ran in autonomous mode, and every transcript was published. Total API costs amounted to about $125.
Argonne National Laboratory built a half-scale model of a passive cooling system: a 220 kW heated wall, an air gap, and a 20-meter pipe with buoyancy-driven draft. The agents were provided with drawings but no measured results.
Results:
A systematic deviation—an overestimation of air temperature by 14–31%—was traced back to a single input parameter that every agent had previously flagged as a risky assumption.
TRISO is a fuel that “cannot melt”: each particle, about the size of a poppy seed, contains a uranium nucleus inside a four-layer shell, where the critical layer is a ~35-micron-thick silicon carbide layer. Safety is a statistical property of billions of such particles.
The IAEA conducted a benchmark test: irradiated fuel spheres (each containing ~15,000 particles) were held at 1,600–1,800 °C with a detector recording the failure of each particle. The contestants were tasked with reproducing these results.
For the third task—the famous real-reactor self-rescue test—the agent independently set up a Monte Carlo code for neutron-physical calculations, loaded 3.4 GB of nuclear data, and computed its own physical constants, before producing a prediction.
Where the physics was complete, the agents agreed with laboratory measurements within a few percent. Where they used industry-standard correlations, they reproduced the systematic errors of the nuclear industry with alarming accuracy. If the specification lacked physics, the agents “invented” it.
Every major error was traced back to a specific cause, and the agents identified most of them in advance. Three independent AI audits using an adversarial approach caught the author in exaggerations—including one headline claim that had to be completely retracted.
All materials—transcripts, models, and evaluations—are published in the open repository eng-bench.