- Researchers at Moonshot AI and Tsinghua University published a paper on April 19, 2026 proposing PrfaaS, an architecture that offloads LLM prefill computation to remote compute-dense clusters connected via commodity Ethernet.
- In a case study using an internal 1T-parameter hybrid model, PrfaaS demonstrated 54% higher serving throughput over a homogeneous prefill-decode baseline and 32% over a naive heterogeneous configuration.
- The advantage narrows to approximately 15% when normalized to equal hardware cost, as the full gain partly reflects pairing higher-compute H200 GPUs for prefill with lower-cost H20 GPUs for decode.
- The architecture’s feasibility depends on hybrid transformer models, which generate substantially smaller KVCache than dense GQA models, enabling cross-datacenter transfer on standard networks.
What Happened
Researchers from Moonshot AI and Tsinghua University published a paper on April 19, 2026 proposing Prefill-as-a-Service (PrfaaS), a cross-datacenter inference architecture that separates the prefill phase of large language model serving from decode across distinct datacenter clusters. The paper, available on arXiv (2604.15039), describes routing long-context prefill requests to standalone, compute-dense remote clusters and transferring the resulting KVCache to local decode hardware over commodity Ethernet. The research addresses a structural bottleneck in current LLM infrastructure: prefill-decode disaggregation has historically been confined to single datacenters because KVCache volume requires RDMA-class interconnects that do not extend between facilities.
Why It Matters
Prefill-decode (PD) disaggregation has been a focus of LLM infrastructure research, with prior systems such as Splitwise and DistServe demonstrating that separating compute-intensive prefill from memory-bandwidth-intensive decode can improve GPU utilization within a single datacenter. The consistent barrier to extending that separation across datacenters has been KVCache transport cost: dense transformer models using Grouped Query Attention produce caches at rates that saturate standard Ethernet links. PrfaaS argues that the growing adoption of hybrid architectures—models combining attention layers with subquadratic components—reduces KVCache size enough to bring cross-datacenter transfer within commodity network limits for the first time.
Technical Details
The paper quantifies the problem for dense models by benchmarking MiniMax-M2.5, a representative dense GQA model, which the researchers report generates KVCache at approximately 60 Gbps for a 32,000-token request on a single 8×H200 instance—a data rate incompatible with standard inter-datacenter Ethernet links. The hybrid model used in the PrfaaS case study produces a substantially smaller KVCache, remaining within commodity bandwidth budgets. In that case study on an internal 1T-parameter hybrid model, PrfaaS demonstrated 54% higher serving throughput than a homogeneous PD baseline and 32% higher than a naive heterogeneous configuration. The research team notes that the full 54% advantage “comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode,” and that when comparing at equal hardware cost the gain is approximately 15%.
Who’s Affected
The architecture targets operators running multi-datacenter GPU inference fleets with long-context workloads, particularly those deploying or building on hybrid transformer models. Moonshot AI, one of the paper’s institutional contributors, operates large-scale inference infrastructure for long-context LLM products. Cloud providers and inference API companies evaluating disaggregated serving strategies across datacenter boundaries are the most direct potential adopters.
What’s Next
As of April 20, 2026, the PrfaaS paper is an arXiv preprint and has not undergone peer review. The throughput results are drawn from a case study on an internal model not available for independent evaluation, so external reproduction has not been established. The authors do not specify a deployment timeline or plans for open-source release.