When does autoresearch need a human?
Autonomous research agents are everywhere in AI research workflows now. The setup is familiar: the agent reads experimental results, modifies a training script, and iterates, all without a human in the inner loop. The promise is real, too. The agent doesn't get bored, doesn't sleep, and can run far more experiments than any researcher could babysit.
That they're useful is well established. The more open question, and the one we wanted to pin down, is how well they keep themselves on track when left to their own devices. So instead of "do they work?", we asked two more specific questions. First, when an autoresearch agent is left to its own devices and optimises a held-out metric, do participants agree that a higher score on that metric corresponds to a model they actually prefer? Second, when a researcher is available for guidance to the agent at specific moments, what does that human contribute that the loop alone couldn't?
To find out, we applied Karpathy's autoresearch framework to a DPO post-training task: fine-tuning SmolLM2-360M-Instruct on the UltraFeedback preference dataset. The agent inside the loop was Claude Opus 4.7. It ran 50 experiments autonomously, about 10 minutes each. Once it was done, we opened a Claude Code session and asked the same model to take stock of the autoresearch run's 50 experiments and propose a recipe worth trying that the loop hadn't explored. That conversation produced two more recipes. That gave us five models in total: one untrained baseline, two from the autoresearch run, and two from the conversational session. We then asked 300 Prolific participants which they preferred, across 1.5K pairwise comparisons.
You can explore the interactive technical report for the full per-pair tables, charts, and the LLM-clustered comment themes, or download the annotation dataset on Hugging Face.
Key Findings at a Glance
The metric and participants disagreed on whether the autoresearch loop improved the model at all. By its own metric, the autoresearch loop's committed best scored just below the untrained reference's chance level — so the metric said the agent's stable recipes made the model slightly worse than no training. Prolific participants saw it the other way: they preferred the agent's trained recipes over the untrained base, but only barely (~52% win rate, within statistical noise).
A single conversational session with the same model unlocked the only recipes that won decisively. The conversational recipes (LoRA adapters and high-quality data filtering) beat the untrained base at 66% and 60% in human head-to-heads — the only DPO recipes in the study with clear wins, produced after about 5 minutes of researcher guidance. The steer was generic: we didn't suggest LoRA or any specific intervention, we asked Claude to take stock of what had been tried and propose something new. Anyone with experience running agents could have given that prompt. The agent just couldn't ask itself that question.
The metric was directionally right but wobbly at the finish line. Across the four trained recipes, the rank correlation between the agent's success metric and the human preference ranking was strong (Spearman ρ = +0.80). But the recipe with the highest metric score wasn't the one participants ranked first in aggregate, and the metric and participants don't cleanly agree on the top pick (in direct comparison the top two are statistically indistinguishable). Past a sweet spot, the metric is measuring "GPT-4-likeness" more than human taste —
UltraFeedback's preference labels were produced by GPT-4 acting as a judge.Capability ≠ Agency Same model in both stages, same access to the same training infrastructure. What changed was the loop structure around the model. The autoresearch loop optimised efficiently within a frame but didn't naturally step outside it. A researcher with a single meta-prompt did.
In this post, we walk through each stage of the experiment, what participants told us, why the automated metric disagreed with them at the top, and what we take away from the whole thing.
Why we ran this study
Agentic research workflows are everywhere, but two questions about them are mostly answered with vibes. Whether the metric an agent optimises actually tracks what participants want from the resulting model. And where a human alongside the agent contributes something the loop alone can't.
We wanted a clean case study that pinned down both. Same model, same task, same dataset, but with and without a human alongside the agent at specific moments. All of the resulting models then evaluated by participants on Prolific. The setup is deliberately small: SmolLM2-360M-Instruct fits on a single GPU, UltraFeedback is the standard public DPO preference dataset, and autoresearch is open source. The goal isn't SOTA on a leaderboard. It's to learn about the workflow.
The study sits at the intersection of three areas you may recognise: Karpathy's autoresearch as the agent framework, the DPO literature (Rafailov et al. 2023) for the task, and the LLM-as-judge work (Zheng et al. 2023, Judging LLM-as-a-Judge) for the eval. We aren't claiming novelty in any of these. We're providing one clean, end-to-end case study that connects them.
The experiment, in three stages
Stage 1: The autoresearch loop, on autopilot
The autoresearch loop spent 50 experiments trying to find a better DPO recipe. (Five of those slots hit harness parse errors before the agent ran, so 45 actually executed.) Each experiment is a self-contained task. The agent reads the run's results so far, proposes a change to the training script, runs the modified training, measures val_pref_acc on a held-out set of preference pairs, decides whether to keep or discard, and commits if keeping. Across the run, the agent swept learning rates, optimiser settings, NEFTune α, warmup ratios, β values, and several DPO loss variants.
The agent's only result that was above chance was a transient single-run spike to val_pref_acc = 0.508 at experiment 18, which it couldn't reproduce. Experiment 29 explicitly tried to recover that regime and got 0.454 (training diverged); subsequent experiments built on top of exp 18 with additional tweaks that each scored fractionally better than the immediate previous step, drifting back below chance. The autoresearch loop makes keep/discard decisions on single runs without resampling to confirm whether a peak is reproducible, which is part of why a transient win at exp 18 didn't translate into a stable improved recipe. Its final committed best was val_pref_acc = 0.492 (experiment 49, combining apo_zero loss with adam_β₂=0.95 and NEFTune). The untrained model scores 0.500 by construction: when the policy equals the reference, neither response is preferred. So the agent's stable committed recipes ended up below chance — by the metric they were optimising, the trained policies disagreed with UltraFeedback's labels more often than a coin flip.
The 50 experiments cluster into a few coherent phases:
| Phase | Experiments | What the agent tried | Best val_pref_acc |
Notes |
|---|---|---|---|---|
| 1 | exp 1–5 | Learning rate doublings, linear vs cosine scheduler | 0.464 | Default-recipe ceiling |
| 2 | exp 6–13 | DPO loss-shape (β tweaks, label smoothing, RPO loss, scheduler changes) | 0.466 | Marginal gains |
| 3 | exp 14–17 | NEFTune input-embedding noise | 0.484 | First real lift |
| 4 | exp 18–23 | Optimizer parameters (adam_β₂=0.95, etc.) |
0.508 | Peak; first crossing of chance |
| 5 | exp 28–36 | WPO weighting, batch-shape, weight decay | 0.486 | Couldn't recover exp 18's level |
| 6 | exp 37–48 | LR fine-tuning around 2.5e-6, NEFTune retry | 0.498 | Oscillating |
| 7 | exp 49–50 | apo_zero DPO loss variant |
0.492 | Final committed best |
What's worth pausing on is what the agent never tried. Experiment 10 ends with the agent's own reflection: "the next experiments should probe orthogonal axes (optimizer, batch shape, LoRA, or NEFTune) rather than the loss-shape knobs." Claude itself flagged LoRA as a future direction at experiment 11. The next 40 experiments never followed up.
The agent isn't capability-limited. It knows LoRA exists and has thoughts about why it might help. What the autoresearch loop didn't have was a structural prompt to revisit the meta-question. The loop optimises within the frame, and it doesn't naturally step outside it.
We kept two recipes from this stage for the Prolific eval: Recipe A (the default DPO baseline, val=0.464) and Recipe B (the agent's autonomous best, val=0.492).
Stage 2: The same model, in conversation
After the autonomous run finished, we opened a Claude Code session with the same model and pointed it at the autoresearch branch, with full access to the run's results and its commit history. We asked it to take stock of all 50 experiments and propose a recipe worth trying that the loop hadn't explored.
Within minutes, Claude proposed two interventions the autonomous loop had never tested. The first was LoRA adapters at rank 32, keeping the agent's best optimiser settings. The second was filtering UltraFeedback to high-margin preference pairs (margin ≥ 2 between chosen and rejected scores), on the reasoning that a cleaner subset would give better gradient signal at this small scale. Claude then wrote the code, kicked off the training, and watched the logs. The result came back at val_pref_acc = 0.628, which we called Recipe C. Without any further prompting, Claude proposed a follow-up: the same recipe with LoRA rank 64. That produced Recipe D at val=0.648.
Total researcher guidance: about 5 minutes — the prompt and a brief check-in on the proposed recipes. The agent did the design and the training itself. The autoresearch loop had spent 8 hours plateauing at 0.492; a 5-minute steer in conversation lifted val_pref_acc by 0.156. The steer was generic, take stock and propose something new, and the agent in its loop just couldn't ask itself that question. Multiply this kind of small drift across a real research project and the cost compounds.
Stage 3: 1.5K pairwise comparisons on Prolific
Each trained recipe was re-trained at full-data, full-epoch settings (~1h per recipe on a single A100 40G) to produce the production checkpoints used for response generation. We then ran all five models (the untrained base plus Recipes A through D) head-to-head on Prolific. Each of the 10 unique pairs was evaluated on 50 general-audience prompts, spanning eight categories: creative writing, advice, pedagogy, sensory descriptive, persuasive, emotional tone, light planning, and common factual. Three independent participants rated each pair instance.
Participants saw a prompt and two responses labelled A and B (display order randomised), then chose between "A is better", "About equal", or "B is better". A comment field was optional. Final sample: 1,507 pairwise judgements from 305 Prolific participants, blind to model identity.
We used standard pairwise-preference analysis (Wilson 95% CIs, Bradley-Terry for the global ranking, Spearman ρ between the metric and that ranking). Full methodology and per pair stats are in the report.
What 300 participants told us
Three findings come out of the data.
The metric and participants disagree even at the bottom of the leaderboard. By val_pref_acc, the agent's autonomous recipes (A and B) scored below the untrained reference (0.464 and 0.492 vs 0.500) — the metric says the agent's DPO made the model slightly worse than no training at all. Prolific participants saw it the other way: in head-to-head comparisons against the untrained base, Recipe A won 52.7% [44.2, 61.0] and Recipe B won 52.1% [43.2, 60.8]. Both Wilson 95% CIs overlap chance, so these are directional wins, not decisive ones. The Bradley-Terry global ranking puts the untrained base last of all five recipes, below all four trained variants. So the agent's autonomous DPO did improve the model in human terms, though just barely, even though its own metric said it had made things worse.
Recipes C and D won decisively over the untrained base. Recipe C beats base 66.4% [57.6, 74.2] of the time. Recipe D beats base 59.7% [50.9, 67.9]. These are statistically clean wins, with confidence intervals clear of chance. The contrast with Recipes A and B (around 52%, CIs overlapping chance) is sharp. Same model, same training infrastructure, very different outcomes. The differentiator was whether the loop around Claude prompted it to step outside the standard recipe template.
At the top of the eval, the metric and participants don't cleanly agree on the winner. Recipe D scored higher than Recipe C on val_pref_acc (0.648 vs 0.628). In direct head-to-head, D narrowly beat C (52.1% [43.2, 60.9]) — but the CI overlaps chance, so they're statistically indistinguishable in pairwise comparison. The Bradley-Terry global ranking, which aggregates each recipe's performance across all 10 pairwise matchups, puts C first and D second. C's top spot comes from stronger performance against the rest of the field, not from beating D directly. Spearman ρ between the metric and the BT ranking across the four trained recipes is +0.80 — directionally strong, with C and D swapped at the top. If we'd selected the production recipe purely by val_pref_acc, we'd have shipped Recipe D — the model that's second-best in the global ranking, not the BT-favoured Recipe C.
This isn't the metric being broken. Spearman ρ = +0.80 is a strong correlation; the broad ordering matches. The metric just gets less reliable at the top, where small stylistic differences matter more than gross quality.
Why the metric disagrees at the top
Why does val_pref_acc agree with participants at +0.80 but disagree on the top pick? The clean answer is that the metric isn't measuring "what participants prefer" directly. It's measuring how often the model's implicit reward agrees with UltraFeedback's labels, and those labels come from GPT-4 acting as a judge.
This matters because GPT-4-as-judge has well-documented systematic biases. Judging LLM-as-a-Judge (Zheng et al. 2023) catalogues several: position bias, verbosity bias, and a preference for formatted-looking output (bullet points, headers, hedging preambles, comprehensive lists). When we score a model against GPT-4-judged labels, we're not measuring "is this response good." We're measuring "does this response trigger GPT-4's preference patterns."
Inside a sweet spot, those patterns are good proxies for what participants want too. participants like structure that earns its length and reasonably thorough explanations. Optimising for GPT-4's preferences is also, within range, optimising for human preferences. That's where the +0.80 correlation comes from. Past the sweet spot, the model becomes too GPT-4-like: longer than participants want, more structured than the prompt calls for, hedged where direct would land better. That's where val_pref_acc keeps climbing and human preference starts to fall.
The cleanest evidence in our data is the C-vs-D comparison. They're literally the same recipe with one parameter changed: LoRA rank 32 → 64. The larger adapter has more capacity to mirror UltraFeedback's chosen style. By the metric, that's an improvement. By human preference, it's a regression. Same base model, same data, same hyperparameters except adapter capacity. The capacity to be more GPT-4-like is the variable, and it inverts the human-preference signal.
We can hear the same thing in what participants wrote. The most common themes from disagreement scenarios are exactly what the GPT-4-bias literature predicts: "too long", "too rambling", "more concise / direct", "felt more natural / human", "robotic / cold", "structured but didn't address the actual prompt". participants were pushing back on the stylistic patterns GPT-4-as-judge rewards. The full per-category breakdown and the LLM-clustered comment themes are in the report.
Key takeaways
Loop structure determines what an agent can find. Same model, same infrastructure, different outcomes depending on how the loop is built. Optimisation style loops search efficiently within a frame; conversational loops can step outside it. Production workflows will likely want both.
Generic meta-prompts unblock a lot. Five minutes of "take stock and propose something new" produced recipes the autoresearch loop never reached. Worth scripting in as a periodic check-in for any long-running agent loop.
Metrics find neighbourhoods, not winners. Spearman ρ = +0.80 says the metric points in the right direction; the top of the leaderboard says it isn't fine-grained enough to pick between near-equal candidates. A small structured human eval at that point disambiguates.
Placement matters more than effort. Five minutes at research-time, small compared to the agent's 8 hours of training. But placed exactly at the moments the agent couldn't self-correct. Across a real research project, agents drift in small and large ways throughout, and skipped check-ins compound.
Caveats and what we'd explore next
This isn't a systematic exploration of autoresearch. It's one run, one base model, one task, one preference dataset, one scaffold. The agent's trajectory is stochastic, where a different starting state or seed might produce different results. The conversational meta-prompt that unstuck the loop wasn't ablated, so we don't know which features of it were load-bearing. And while the global Bradley-Terry ranking clearly puts Recipe C above D, the direct C-vs-D pairwise comparison has a confidence interval that overlaps chance. The story we're telling is consistent with the data, but the data is one case study, not a controlled experiment.
A few follow-ups would sharpen the picture:
Scale up and catalogue the failure modes. More autoresearch runs, more seeds, more base models, more metrics, more compute time. Treat this study as one entry in a catalogue of how agents fail when left to run autonomously — more cases across settings would help spot the patterns.
Different autoresearch scaffolds that build in zoom-out check-ins. The current harness reasons about each experiment in isolation, with no built-in mechanism for the agent to step back and ask "am I still pointed at the right problem?" or "what categories of intervention have I not considered?" A scaffold that schedules these self-checks automatically — every N experiments, or when a phase plateaus — might mitigate the drift we saw without needing a human in the loop at all. Whether that closes the gap, or just shifts where the agent gets stuck, is the obvious next experiment.
Identify where and when human input is most instrumental. We landed on two effective placements (research-time meta-prompts and eval-time disambiguation) by intuition. The systematic version — testing different stages, cadences, and types of intervention — would turn placement intuition into something operational.
If you've run something similar, we'd love to hear how it lined up. The agentic-research story is still mostly anecdotal, and the more case studies in the open the better.
