A preprint submitted to arXiv on March 28, 2026 by Ramtin Zargari Marandi directly measures how debate protocol structure affects multi-agent debate (MAD) system performance — and demonstrates that protocol design has a measurable, independent effect on outcomes. The paper, available on arXiv (ID: 2603.28813) under the title The impact of multi-agent debate protocols on debate quality: a controlled case study, presents a controlled macroeconomic case study designed to isolate protocol effects from model effects. The paper has not yet undergone peer review as of publication.
- The novel Rank-Adaptive Cross-Round (RA-CR) protocol achieves faster consensus convergence than standard Cross-Round (CR) in a controlled macroeconomic setting spanning 20 events and five random seeds.
- A direct trade-off exists between peer-referencing rate and convergence speed — higher interaction does not produce faster consensus.
- No-Interaction (NI) baseline maximizes argument diversity; all three interaction protocols reduce diversity as agents align their outputs.
- The study controlled model variables using matched prompts and decoding settings, isolating protocol structure as the independent variable.
What Happened
Marandi published the preprint on March 28, 2026, introducing a controlled methodology for evaluating multi-agent debate protocols. The study compares three interaction protocols — Within-Round (WR), Cross-Round (CR), and the novel Rank-Adaptive Cross-Round (RA-CR) — against a No-Interaction (NI) baseline across 20 diverse macroeconomic events. Prior MAD research typically held the protocol fixed while varying model parameters, making it impossible to determine whether observed performance improvements came from model quality or protocol structure. This study inverts that approach by holding model factors constant and varying only the protocol.
Why It Matters
Multi-agent debate systems route queries through multiple LLM instances that critique and refine each other’s outputs across structured rounds. The approach has shown promise for improving reasoning quality on complex tasks, but published evaluations have generally benchmarked models rather than protocol architecture — leaving open the question of which design choices independently drive performance.
Without separating those dimensions, researchers cannot determine whether a new protocol actually improves debate quality or simply runs on top of a stronger underlying model. Marandi’s controlled design fills that methodological gap with a reproducible, domain-specific case study that treats protocol structure as the variable under investigation.
Technical Details
The four conditions differ in how much prior context agents can access and whether any filtering mechanism is applied. Within-Round (WR) agents see only the current round’s contributions from peers. Cross-Round (CR) agents retain full context from all previous rounds. The paper’s novel contribution, Rank-Adaptive Cross-Round (RA-CR), introduces an external judge model that ranks agent outputs each round, dynamically reorders agents based on those rankings, and silences the lowest-ranked agent per round. The No-Interaction (NI) baseline has agents produce independent responses with no peer visibility.
In the controlled macroeconomic case study — 20 diverse events, five random seeds, matched prompts and decoding settings — RA-CR demonstrated faster convergence to consensus than CR, while WR produced higher rates of peer-referencing. NI maximized argument diversity across all rounds and was unaffected by the interaction protocols.
“These results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation), confirming protocol design matters,” Marandi writes in the paper. When consensus formation is the priority metric, RA-CR outperforms all three other conditions tested.
Who’s Affected
Researchers and engineers designing multi-agent reasoning pipelines face concrete protocol choices: how many agents to include, how many rounds to run, what aggregation rule to apply, and how much prior context agents retain. This paper provides a methodological baseline for those decisions, specifically in analytical domains such as macroeconomic forecasting where rapid consensus formation is more desirable than maximizing output diversity.
Teams using multi-agent orchestration approaches for structured analysis or advisory tasks are the most direct audience. The RA-CR protocol introduces an external judge model dependency — a layer of infrastructure not present in simpler WR or CR designs — which carries latency and cost implications the paper does not fully quantify at scale.
What’s Next
The study is scoped to a macroeconomic case study domain, and Marandi does not claim the findings generalize universally to other task types or domains with higher answer ambiguity. The RA-CR protocol’s convergence advantage depends on the external judge model’s ranking accuracy, introducing a failure mode not present in the other protocols.
Marandi identifies the interaction-convergence trade-off as a core design consideration for future MAD research. Whether RA-CR’s advantage holds across longer deliberation chains, different domain types, or varying numbers of agents remains an open empirical question for follow-on work.