PAR²-RAG Framework Beats IRCoT by 23.5% on Multi-Hop Question Answering

A team of eight researchers submitted a new retrieval-augmented generation framework to arXiv on March 30, 2026, targeting a documented reliability failure in systems that answer questions requiring evidence from multiple documents. The paper, titled PAR²-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering, is authored by Xingyu Li, Rongguang Wang, Yuying Wang, Mengqing Guo, Chenyang Li, Tao Sheng, Sujith Ravi, and Dan Roth.

PAR²-RAG uses a two-stage design separating broad evidence gathering (breadth-first anchoring) from focused refinement (depth-first reasoning), addressing two distinct failure modes in existing systems.
Evaluated across four multi-hop QA benchmarks, it achieved up to 23.5% higher accuracy than IRCoT, a leading iterative retrieval baseline.
Retrieval quality, measured in NDCG, improved by up to 10.5% over IRCoT in the same benchmark evaluations.
The paper was submitted to arXiv and has not yet undergone peer review.

What Happened

Xingyu Li, Dan Roth, and six co-authors published PAR²-RAG on arXiv on March 30, 2026, proposing a framework to improve multi-hop question answering (MHQA) — tasks where a model must retrieve and connect evidence across multiple documents to produce a single answer. The paper evaluates the system against existing state-of-the-art baselines across four MHQA benchmarks, outperforming all of them. The authors frame the contribution as a solution to two failure modes that prior systems address only in isolation.

Why It Matters

Multi-hop QA has been a persistent weak point for large language models even as their general capabilities have expanded. The difficulty is not purely generative: it lies in retrieving the right chain of evidence before any reasoning begins, then adapting as that evidence shifts the requirements for subsequent retrieval steps.

Two architectural families have tried to solve this. Iterative retrieval systems retrieve and reason in a loop, but they risk locking onto an early low-recall trajectory — once the first retrieval step returns poor documents, each subsequent step compounds the error. Planning-only systems generate an upfront query plan, which avoids early lock-in but produces a static set of queries that cannot adapt when intermediate findings change what still needs to be retrieved.

PAR²-RAG attempts to resolve both problems within a single framework by structuring retrieval in two sequential stages with different objectives.

Technical Details

The authors describe PAR²-RAG as “a two-stage framework that separates coverage from commitment.” The first stage — breadth-first anchoring — sweeps broadly across potentially relevant documents to establish what the paper calls a “high-recall evidence frontier” before any answer reasoning is committed. This stage prioritizes recall over precision, deliberately collecting a wider evidence set than may ultimately be needed.

The second stage applies depth-first refinement within an iterative loop, using an evidence sufficiency control mechanism to determine when the accumulated evidence is adequate to support a final answer. Rather than retrieving until a fixed number of documents is reached, the system continues refining until a sufficiency threshold is met, at which point it exits the loop and generates the answer.

Compared to IRCoT — Interleaving Retrieval with Chain-of-Thought Reasoning, a widely cited iterative retrieval baseline — PAR²-RAG achieved up to 23.5% higher accuracy and up to 10.5% higher NDCG across the four benchmarks tested. NDCG, or Normalized Discounted Cumulative Gain, measures retrieval quality by rewarding systems that rank the most relevant documents highest; a 10.5-point gain indicates meaningfully better document ranking, not just a larger retrieved set.

The abstract does not list the four benchmarks by name, and full result breakdowns by dataset are not available from the published abstract alone.

Who’s Affected

The work is most directly relevant to engineers and researchers building RAG pipelines for tasks that require chained document retrieval — enterprise knowledge base search, legal due diligence tools, biomedical literature review, and any system where a single query must synthesize evidence from multiple independent sources.

IRCoT, one of the baselines PAR²-RAG outperforms, has been adopted in a number of research and production QA systems since its publication, making the accuracy comparison directly actionable for teams currently benchmarking retrieval architectures.

What’s Next

The paper was submitted to arXiv on March 30, 2026, and has not yet undergone peer review. The abstract does not announce a public code release or identify a target conference venue. Until peer review is complete, the reported accuracy figures should be treated as the authors’ own benchmark results rather than independently verified findings.

Key open questions include how the evidence sufficiency control mechanism behaves on out-of-distribution documents, and whether the two-stage overhead affects latency in production settings — neither of which is addressed in the abstract.

PAR²-RAG Framework Beats IRCoT by 23.5% on Multi-Hop Question Answering

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

OpenAI’s Ads Made $100 Million in 6 Weeks — It Took Google Years to Hit That Number

OpenAI Has 900 Million Weekly Users — More Than Instagram, Less Than YouTube, Growing Faster Than Both

Microsoft’s AI Chief Says Rewriting the OpenAI Contract ‘Unlocked Superintelligence’ — Here’s What That Means

Before you go…