- SkyPilot enabled Claude Code to run Andrej Karpathy’s autoresearch project across 16 GPUs in parallel, achieving a 9x speedup over single-GPU sequential execution.
- The parallel agent submitted 910 experiments in 8 hours and reduced validation loss from 1.003 to 0.974, a 2.87 percent improvement.
- Total cost was approximately $309: $9 for Claude Code API calls and $300 for GPU compute across 13 H100s and 3 H200s.
- Without explicit instruction, the agent independently discovered performance differences between GPU types and developed a two-tier validation strategy.
What Happened
Researchers at SkyPilot demonstrated that giving an AI coding agent parallel access to a GPU cluster can dramatically accelerate automated machine learning research. Their study scaled Andrej Karpathy’s autoresearch project from a single GPU to 16 Kubernetes-managed GPUs, cutting the time to reach the same validation loss from approximately 72 hours to 8 hours.
The project used Claude Code as the autonomous research agent. It modified training scripts, submitted experiments, evaluated results, and iteratively improved a language model’s performance, all without human intervention during the 8-hour run. Authors Alex Kim and Romil Bhardwaj published the results on the SkyPilot blog.
Why It Matters
AI-driven research has been constrained by sequential execution. An agent running on one GPU can only test hypotheses one at a time, creating a bottleneck where promising ideas sit in a queue while less productive experiments consume compute time. SkyPilot’s approach removes that constraint by letting the agent fan out across multiple GPUs simultaneously, testing dozens of ideas in parallel and converging on better solutions faster.
The throughput jumped from roughly 10 experiments per hour on a single GPU to 90 experiments per hour across the cluster. Over the 8-hour session, the agent submitted 910 total experiments. Kim and Bhardwaj reported that the parallel approach reached the same validation loss in one-ninth of the time, suggesting that agent-driven research scales predictably with available compute.
The cost profile makes the approach accessible. The entire 8-hour run cost approximately $300 in GPU compute plus $9 in Claude Code API fees, totaling $309. That puts parallel agent research within reach of academic labs, startups, and independent researchers who have access to cloud GPU providers but not dedicated supercomputer allocations.
Technical Details
The cluster consisted of 13 H100 and 3 H200 GPUs managed through SkyPilot’s job scheduling and cluster provisioning system. Each experiment was constrained to a fixed 5-minute training budget to maintain comparability across runs. SkyPilot handled resource allocation, job queuing, and fault tolerance automatically.
The agent progressed through five emergent phases without being explicitly programmed to follow any particular research strategy. It began with hyperparameter sweeps across roughly 200 experiments, reducing validation bits-per-byte (val_bpb) from 1.003 to 0.981. It then discovered architectural improvements by scaling model width from 384 to 768 dimensions, pushing val_bpb to 0.977. Fine-tuning and optimizer adjustments followed in phases three and four, with the agent adjusting muon_beta2 to 0.98 and reaching a final val_bpb of 0.974. The fifth phase, spanning approximately 210 experiments, showed diminishing returns with improvements dropping below 0.0001 per experiment.
One notable emergent behavior stood out. Without any explicit guidance, the agent detected that H200 GPUs produced more reliable results than H100s and independently developed a two-tier strategy: “screen 10+ hypotheses cheaply on H100s in parallel, then promote the top 2-3 to H200 for confirmation runs.” This kind of resource-aware optimization was not part of the agent’s instructions and emerged purely from observing experimental outcomes across different hardware.
Who’s Affected
Machine learning researchers and AI labs running automated experimentation pipelines stand to benefit most. The study suggests that scaling agent access to compute resources could compress weeks of research iteration into hours. SkyPilot’s infrastructure layer handles the orchestration, meaning teams do not need to build custom job scheduling systems or manage GPU allocation logic themselves.
Cloud GPU providers may see increased demand from agent-driven research workloads, which are bursty by nature. A researcher might spin up 16 GPUs for an 8-hour session and release them, creating a usage pattern that differs from traditional long-running training jobs.
What’s Next
The study demonstrates a proof of concept but does not address how well the approach generalizes beyond language model training to other domains like computer vision or reinforcement learning. Diminishing returns appeared in the final phase after roughly 700 runs, suggesting a natural ceiling for this particular task configuration. Whether agents can learn to recognize that ceiling and stop spending compute on marginal gains is an open research question.
Related Reading
- Karpathy’s AutoResearch Agent Runs 700 Experiments in Two Days, Cuts GPT-2 Training Time 11%
- Study Finds AI Users Adopt Incorrect Answers 73% of the Time When Models Err
- GPT-5.2 and Claude Opus 4.6 Both Go Silent on ‘Ontologically Null’ Prompts, Study Finds
- Google Research Publishes TurboQuant Two-Stage LLM Compression System