BLOG

Karpathy’s 630-Line Script Ran 50 Experiments Overnight Without a Human

Z Zara Mitchell Mar 31, 2026 Updated Apr 7, 2026 3 min read
Engine Score 7/10 — Important

Karpathy's AutoResearch running 50 autonomous ML experiments overnight with 53K GitHub stars signals a shift in how ML research is conducted.

Editorial illustration for: Karpathy's 630-Line Script Ran 50 Experiments Overnight Without a Human
  • Andrej Karpathy released autoresearch, a 630-line Python script that lets an AI agent autonomously run machine learning experiments on a single GPU overnight.
  • In one overnight session, the agent completed 126 experiments, reducing validation loss from 0.9979 to 0.9697 without any human intervention.
  • Shopify CEO Tobias Lutke tested the system on internal data and reported a 19% model performance improvement after 37 autonomous experiments.
  • The repository has attracted 64,000 GitHub stars and 9,000 forks in under a month.

What Happened

On March 7, Andrej Karpathy, the former Tesla AI director and OpenAI co-founder, pushed a 630-line Python script to GitHub and went to sleep. By morning, his AI agent had run 50 experiments, discovered a better learning rate, and committed the results to git without a single human instruction.

Karpathy described the project on X: “I packaged up the ‘autoresearch’ project into a new self-contained minimal repo. It’s basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code.” The human writes the research prompt in a markdown file, and the AI agent iterates on the training code in a Python file. Every run lasts exactly 5 minutes.

Why It Matters

Machine learning research has traditionally required researchers to manually design experiments, adjust hyperparameters, run training jobs, analyze results, and repeat. Each individual cycle can take hours of focused human attention. Autoresearch compresses this into a fully autonomous loop. The agent modifies the training code, runs a 5-minute experiment, evaluates whether validation loss improved, and either keeps or discards the change before starting the next iteration, all without any human input.

The results are significant. In a two-day continuous run, Fortune reported that the agent processed approximately 700 autonomous changes, finding roughly 20 additive improvements that transferred perfectly to larger models. Those stacked improvements reduced the “Time to GPT-2” benchmark from 2.02 hours to 1.80 hours, an 11% efficiency gain.

Technical Details

The system consists of three files. prepare.py handles data preparation and runtime utilities and remains unchanged by the agent. train.py contains the full GPT model definition, optimizer, and training loop in a single file, and this is the only file the agent modifies. program.md holds natural language instructions that humans write to guide the research direction.

Each experiment runs under a strict 5-minute wall clock budget on a single NVIDIA GPU, tested primarily on H100 hardware. The evaluation metric is validation bits per byte (val_bpb), where lower numbers indicate better performance. The agent operates on a git feature branch, accumulating commits as it finds improvements to neural network architecture, optimizer settings, batch size, and hyperparameters.

At approximately 12 experiments per hour, the system can execute around 100 experiments during an overnight session, covering a search space that would take a human researcher days or weeks to explore manually.

Who’s Affected

The immediate audience is ML researchers and engineers with access to GPU hardware. The repository requires Python 3.10+, PyTorch, and an NVIDIA GPU, though community forks have extended support to macOS, Windows, and AMD platforms. Academic labs, corporate research teams, and independent ML practitioners can all use the system, as the MIT license permits both commercial and non-commercial use.

Shopify CEO Tobias Lutke provided an early industry validation. He ran autoresearch on Shopify’s internal data overnight, and the agent completed 37 experiments, delivering a 19% performance improvement on the company’s model. That result demonstrated the system works beyond academic benchmarks on real production data.

What’s Next

The repository has accumulated 64,000 stars and 9,000 forks on GitHub, with 9 contributors expanding the codebase. The project is 83.4% Python and 16.6% Jupyter Notebook, released under an MIT license that permits commercial use.

The current limitation is the single-GPU constraint, which restricts experiments to models small enough to train in 5-minute increments. Scaling autoresearch to multi-GPU setups or longer training budgets would expand its applicability to production-scale models, but Karpathy has kept the scope deliberately minimal. The dependencies are intentionally sparse: just PyTorch, Python 3.10+, and the UV package manager. The broader implication is a shift from AI as a tool that assists researchers to AI as an agent that conducts research autonomously, a pattern some observers have already dubbed “the Karpathy Loop.”

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime