RESEARCH

Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

M megaone_admin Mar 29, 2026 2 min read
Engine Score 7/10 — Important

Naver's world model using real Street View data to prevent AI hallucinations in city generation is a novel technical approach with implications for autonomous driving and digital twin applications.

Editorial illustration for: Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

Naver has published research on the Seoul World Model, described as the first video world model tied to a real physical location rather than generating fictional environments. The model takes geographic coordinates, desired camera movement, and text prompts as input, retrieves the nearest Street View images, and generates step-by-step video while maintaining spatial accuracy to the actual city layout.

The model is built on Nvidia’s Cosmos-Predict2.5-2B diffusion transformer architecture with 2 billion parameters, trained on 24 Nvidia H100 GPUs. The training data combines 440,000 Seoul Street View images drawn from a pool of 1.2 million panoramic captures taken every 5 to 20 meters by Naver Map, 12,700 synthetic videos generated using the CARLA simulator on Unreal Engine, and driving data from the public Waymo dataset.

A core technical challenge the model addresses is that Street View snapshots are static and contain transient objects like parked cars and pedestrians that do not represent dynamic scenes. Naver’s solution uses cross-temporal pairing, combining reference images and target sequences from different recording times during training to teach the model to distinguish permanent structures from temporary objects.

The system uses a dual-path integration approach. A geometric path projects reference images into the target camera perspective using depth maps to establish spatial layout. A semantic path encodes reference images into latent representations to capture appearance details. A virtual lookahead mechanism retrieves Street View images ahead on the route as error-free landmarks, preventing the accumulated drift that typically degrades long-distance generation in world models.

The research has direct applications in autonomous driving simulation, urban planning visualization, and game environment generation. By grounding video generation in real geographic data rather than learned distributions of generic scenes, the model avoids the hallucination of nonexistent buildings, roads, and city features that limits current world models for safety-critical applications like autonomous vehicle testing.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy