Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

Naver has published research on the Seoul World Model, described as the first video world model tied to a real physical location rather than generating fictional environments. The model takes geographic coordinates, desired camera movement, and text prompts as input, retrieves the nearest Street View images, and generates step-by-step video while maintaining spatial accuracy to the actual city layout.

The model is built on Nvidia’s Cosmos-Predict2.5-2B diffusion transformer architecture with 2 billion parameters, trained on 24 Nvidia H100 GPUs. The training data combines 440,000 Seoul Street View images drawn from a pool of 1.2 million panoramic captures taken every 5 to 20 meters by Naver Map, 12,700 synthetic videos generated using the CARLA simulator on Unreal Engine, and driving data from the public Waymo dataset.

A core technical challenge the model addresses is that Street View snapshots are static and contain transient objects like parked cars and pedestrians that do not represent dynamic scenes. Naver’s solution uses cross-temporal pairing, combining reference images and target sequences from different recording times during training to teach the model to distinguish permanent structures from temporary objects.

The system uses a dual-path integration approach. A geometric path projects reference images into the target camera perspective using depth maps to establish spatial layout. A semantic path encodes reference images into latent representations to capture appearance details. A virtual lookahead mechanism retrieves Street View images ahead on the route as error-free landmarks, preventing the accumulated drift that typically degrades long-distance generation in world models.

The research has direct applications in autonomous driving simulation, urban planning visualization, and game environment generation. By grounding video generation in real geographic data rather than learned distributions of generic scenes, the model avoids the hallucination of nonexistent buildings, roads, and city features that limits current world models for safety-critical applications like autonomous vehicle testing.

Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

Enjoyed this story?

MetaClaw Framework Trains AI Agents During Calendar Downtime

CERN Deploys Sub-200 Nanosecond AI Models on FPGAs to Filter 40 Million Collisions Per Second at the LHC

University of Florida Researchers Achieve 99 Percent AI Jailbreak Rate Using Internal Model Architecture Attacks

Before you go…