Lead researcher Yuma Ichikawa and thirteen co-authors submitted OneComp: One-Line Revolution for Generative AI Model Compression to arXiv on 30 March 2026, presenting an open-source framework that reduces the expert-intensive process of post-training model quantization to a single command paired with a model identifier and a hardware description.
- OneComp automates mixed-precision quantization of foundation models, removing the need for manual algorithm selection or precision budgeting from practitioners.
- The framework executes a three-stage compression pipeline — layer-wise compression, block-wise refinement, and global refinement — adapting to available hardware automatically.
- A “deployable pivot” design ensures the first quantized checkpoint is immediately usable, with quality improving incrementally as more compute is applied.
- OneComp is released as open-source software with an extensible architecture designed to incorporate new compression algorithms as interchangeable modules.
What Happened
Yuma Ichikawa led a team of fourteen researchers in submitting OneComp to arXiv (arXiv:2603.28845) on 30 March 2026, introducing an open-source post-training compression framework that automates quantization of large language models across three progressive stages, requiring only a model identifier and a hardware specification to execute a complete compression run.
The paper describes a pipeline that inspects model structure, determines mixed-precision bit-width assignments, and executes layer-wise compression, block-wise refinement, and global refinement in sequence. Each stage operates on the same base checkpoint — what the authors call the “deployable pivot” — so the first output is always a usable model, and subsequent stages increase quality rather than replacing it.
Why It Matters
Deploying large foundation models has long been constrained by memory footprint, inference latency, and hardware costs, three problems that post-training quantization can mitigate, but solving which has historically demanded manual selection of compression algorithms, precision budgets, calibration data strategies, and hardware-specific execution configuration that most engineering teams lack the expertise to apply reliably.
Prior quantization research — including methods such as GPTQ, AWQ, and SmoothQuant — demonstrated that reducing parameter precision from 16-bit floats to lower bit widths can preserve much of a model’s performance. These methods have typically been applied individually, requiring practitioners to understand tradeoffs between approaches. OneComp is designed to compose multiple compression techniques into a unified sequential pipeline without requiring that depth of expertise.
Technical Details
OneComp’s pipeline automates three stages of compression — layer-wise quantization, block-wise refinement, and global refinement — assigning different bit-width budgets to individual model layers based on available hardware constraints rather than requiring users to specify precision targets, while also handling calibration data strategies internally within the pipeline.
The “deployable pivot” is a core architectural decision: the first quantized checkpoint is treated as a fully deployable model rather than an intermediate artifact, and each successive stage improves it incrementally. As the authors describe in the paper, “OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement.”
The framework is described as resource-adaptive and hardware-aware: it inspects available compute before generating a compression schedule, meaning a single command can produce different mixed-precision configurations depending on whether it runs on a workstation GPU or a high-memory data center accelerator. The authors claim this approach does not significantly degrade model performance relative to full-precision baselines, but the publicly available abstract does not include benchmark tables that would verify this claim.
Who’s Affected
ML engineers and infrastructure teams deploying open-weight foundation models on memory-constrained or cost-sensitive hardware — including single-GPU servers, edge devices, and budget cloud instances — are the primary audience for OneComp, which removes the quantization expertise barrier that has prevented many teams from compressing models for production use.
Compression researchers also stand to benefit from the extensible architecture. OneComp is designed to accept new quantization algorithms as modules, meaning teams developing novel compression methods can integrate them into the pipeline without rebuilding surrounding infrastructure. The open-source release is intended to lower the barrier between publishing a compression algorithm and shipping it in a production-grade deployment.
What’s Next
The arXiv submission does not include benchmark tables comparing OneComp’s output quality against manually tuned baselines across specific model families or hardware targets, leaving the authors’ claim of minimal performance degradation unverified by independent evaluation at the time of publication — a gap that peer review or community benchmarking will need to close.
The paper positions OneComp as infrastructure for bridging compression research and production deployment. Calibration data requirements — a known variable affecting quantization quality across domains — are acknowledged as a pipeline component but are not detailed in the abstract. The source code is described as open-source; the full author list, led by Yuma Ichikawa, is available in the arXiv submission.
