- Aurora is a reinforcement-learning-based framework that improves AI inference speed by 1.25x over static speculative decoding by learning from live inference traces in real time.
- The system continuously updates its speculative decoder without interrupting model serving, adapting to shifting traffic patterns across different domains and use cases.
- Aurora was tested on Llama 3.1 70B and Mixtral 8x22B, showing consistent speedups across code generation, conversational AI, and document summarization workloads.
- At scale, Aurora’s inference cost reduction could save large API providers millions of dollars annually by reducing GPU-hours per token generated.
What Happened
A research team led by Yao Fu and Hao Peng at the University of Edinburgh published Aurora, a framework that accelerates large language model inference by training a speculative decoder in real time from live serving data. The paper, posted to arXiv on March 28, 2026, demonstrates that Aurora delivers a 1.25x additional speedup over a well-trained static speculative decoder — without requiring any retraining downtime or offline data collection.
Speculative decoding, the technique Aurora builds on, uses a smaller “draft” model to predict multiple tokens ahead, which the larger model then verifies in parallel. Aurora’s contribution is making the draft model adaptive: it learns continuously from the actual queries the system processes.
Why It Matters
Inference cost is now the dominant expense for organizations running large language models at scale. SemiAnalysis estimated in February 2026 that inference accounts for 70-85% of total LLM operational costs for API providers, with speculative decoding already reducing those costs by 30-50% when well-tuned. But static speculative decoders have a fundamental limitation: they are trained on a fixed dataset and degrade when traffic patterns shift.
A customer service chatbot that handles billing questions during the day and technical support at night generates very different token distributions. A static speculative decoder trained on mixed data compromises on both. Aurora solves this by adapting the decoder to whatever the model is currently processing.
Technical Details
Aurora’s architecture has three components. The first is a trace buffer that collects accepted and rejected token sequences from the verification step of speculative decoding during normal serving. This data captures what the draft model gets right and wrong in real time. The second is a lightweight reinforcement learning module that uses proximal policy optimization (PPO) to update the draft model’s parameters based on the trace buffer, running asynchronously on spare GPU capacity without competing with the serving workload.
The third component is a domain detector that identifies shifts in traffic distribution — for example, a transition from conversational queries to code generation — and adjusts the learning rate and exploration parameters accordingly. This prevents the decoder from catastrophically forgetting previous domains while adapting to new ones.
The researchers benchmarked Aurora on Llama 3.1 70B and Mixtral 8x22B across three workload categories. On code generation tasks, Aurora achieved a 1.31x speedup over a static speculative decoder that had been trained on 500,000 code samples. On conversational AI, the speedup was 1.22x. On document summarization, 1.24x. The acceptance rate of speculated tokens — the key metric for speculative decoding efficiency — improved from 71% (static) to 83% (Aurora) on average.
The overhead of Aurora’s online learning is modest. The RL update loop consumes approximately 4% of total GPU compute and adds less than 2ms of latency per batch. The trace buffer requires 1.2GB of GPU memory, and the domain detector adds negligible overhead.
Who’s Affected
API providers running high-volume inference — companies like OpenAI, Anthropic, Google, and the growing ecosystem of open-model hosting providers including Together AI, Fireworks, and Anyscale — stand to benefit most. At the scale of billions of tokens per day, a 1.25x speedup translates directly into reduced GPU-hours and lower costs. Fu estimated that for a provider serving 10 billion tokens per day on Llama 3.1 70B, Aurora would save approximately $2.3 million annually in compute costs compared to static speculative decoding.
Enterprise customers running self-hosted models would also benefit, particularly those with diverse workloads where a single static decoder underperforms. The framework is model-agnostic and can be applied to any autoregressive transformer that supports speculative decoding.
What’s Next
Fu and Peng plan to release Aurora’s code and training scripts on GitHub by May 2026. They are also exploring integration with vLLM, the most widely used open-source inference engine, which would make Aurora accessible to the broader community without requiring custom serving infrastructure. One noted limitation is that Aurora has not yet been tested on mixture-of-experts models larger than Mixtral 8x22B, and the RL overhead may scale differently on models with hundreds of billions of parameters. The team is currently running experiments on Llama 3.1 405B to quantify this.
