A team of four researchers submitted AEC-Bench to arXiv on March 31, 2026 — a multimodal benchmark designed to evaluate agentic AI systems on real-world tasks across the architecture, engineering, and construction sector. The paper, authored by Harsh Mankodiya, Chase Gallik, Theodoros Galanos, and Andriy Mulyar, covers the benchmark’s dataset taxonomy, evaluation protocol, and baseline results across several domain-specific foundation model harnesses, and makes all resources available under an open license.
- AEC-Bench evaluates AI agents on three task types: drawing understanding, cross-sheet reasoning, and construction project-level coordination.
- Baseline results were produced across multiple foundation model harnesses, explicitly including Claude Code and Codex.
- The researchers identified harness design techniques that consistently improve performance across different foundation models.
- The complete benchmark dataset, agent harness, and evaluation code are released under an Apache 2 license for full replicability.
What Happened
Harsh Mankodiya, Chase Gallik, Theodoros Galanos, and Andriy Mulyar submitted AEC-Bench (arXiv:2603.29199) on March 31, 2026. The paper introduces the first dedicated multimodal benchmark for evaluating agentic AI systems in the architecture, engineering, and construction domain. It describes the benchmark’s motivation, dataset taxonomy, evaluation protocol, and baseline results, and releases all associated resources openly for independent replication.
Why It Matters
The AEC domain presents challenges that general-purpose AI benchmarks do not capture. Professional workflows require reading technical construction drawings, coordinating information across multiple sheets of project documentation, and reasoning about construction sequencing — tasks that combine visual understanding, cross-document inference, and domain-specific knowledge. Without a purpose-built benchmark, there has been no reliable way to measure how well agentic AI systems perform against these real-world demands.
Domain-specific AI benchmarks have emerged in legal, medical, and software engineering contexts over the past year. AEC-Bench extends this pattern to the built-environment sector, where AI adoption in design and construction workflows has accelerated but evaluation tooling has lagged behind deployment.
Technical Details
The benchmark is organized around three task categories. Drawing understanding tests whether agents can correctly interpret technical construction drawings, which require visual parsing of structured diagrams with domain-specific notation. Cross-sheet reasoning requires agents to reconcile information distributed across multiple project documents, modeling the real-world challenge of tracking specifications, plans, and schedules that reference one another. Construction project-level coordination introduces higher-order task demands, where agents must manage interdependencies across a full project scope.
The authors evaluated performance across “several domain-specific foundation model harnesses,” with Claude Code and Codex explicitly cited as tested systems. A central result is captured in the authors’ own description: “We use AEC-Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex.” This indicates the benchmark surfaced design patterns applicable across different model families rather than optimizations tied to a single system. Specific numeric performance scores for each harness are reported in the paper body; the abstract does not disclose them.
Who’s Affected
Developers building AI agents for architecture firms, engineering consultancies, and construction companies gain a structured evaluation framework. Rather than relying on informal testing or general-purpose benchmarks, teams can now measure agent performance against tasks that reflect actual professional AEC workflows.
Claude Code and Codex were used as explicit baseline harnesses in the evaluation, meaning developers using either system for AEC applications now have a published performance baseline to compare against. Other foundation model teams can use the openly released harness and evaluation code to benchmark their own configurations against the same task set.
What’s Next
The full benchmark dataset, agent harness, and evaluation code have been released under an Apache 2 license at the URL provided in the paper, enabling any research team to replicate or extend the evaluation. The paper characterizes the results as baselines, suggesting AEC-Bench is intended as an ongoing evaluation framework rather than a one-time publication. The dataset taxonomy described in the paper provides a structural foundation for expanding task coverage to additional AEC subdisciplines, though no formal roadmap for benchmark updates was described in the abstract.