Naver, the South Korean internet company, published research on March 29, 2026 introducing the Seoul World Model (SWM), a video generation system anchored in real city geometry drawn from over 1.2 million panoramic Street View captures of Seoul. Jonathan Kemper, writing for The Decoder, described the system as what the researchers call “the first world model tied to a real physical location.” Unlike existing video world models that generate plausible but entirely fictional environments, SWM retrieves actual street-level imagery as spatial anchors during generation.
- SWM draws on 1.2 million Naver Map Street View panoramas to ground video output in real Seoul geography, retrieved via geographic coordinate input.
- A cross-temporal pairing mechanism trains the model to distinguish permanent structures like building facades from transient objects like parked cars and pedestrians.
- In benchmarks, SWM outperformed six current video world models on both visual quality and temporal consistency.
- The model generalized to Busan, South Korea and Ann Arbor, Michigan without any additional fine-tuning or new data collection.
What Happened
Researchers from Naver and Naver Cloud published a paper describing the Seoul World Model, a location-grounded video generation system that takes geographic coordinates, a desired camera path, and a text prompt as input. The model queries a database of 1.2 million panoramic images from Naver Map, retrieves the nearest Street View images, and uses them as visual guides for step-by-step video generation. The full research paper and methodology are detailed in Jonathan Kemper’s March 29, 2026 report at The Decoder.
The system also accepts text-prompt modifications to generated footage — users can instruct the model to add burning vehicles or insert fictional elements into street-level scenes — while preserving the underlying spatial accuracy of the real route.
Why It Matters
Previous video world models produce visually convincing output, but everything beyond the starting frame — unseen intersections, distant building facades, continuation of a route — is hallucinated from learned distributions of generic scenes rather than actual geography. This limits their use in safety-critical simulation, particularly for testing autonomous vehicle perception systems against specific real-world road layouts, where a model inventing nonexistent infrastructure introduces unacceptable risk.
Naver operates South Korea’s dominant search engine and its own mapping service, Naver Map, which collects street-level panoramas in a format comparable to Google Maps. That existing data infrastructure gives the company a direct path to the imagery SWM requires, without building a separate data collection pipeline.
Technical Details
SWM is built on Nvidia’s Cosmos-Predict2.5-2B diffusion transformer architecture, a 2-billion-parameter model trained across 24 Nvidia H100 GPUs. Training data combines three distinct sources: 440,000 Seoul Street View panoramas drawn from a pool of 1.2 million images captured at intervals of every 5 to 20 meters, 12,700 synthetic videos generated using the CARLA simulator running on Unreal Engine, and real-world driving footage from the public Waymo dataset.
The model uses a dual-path integration architecture. A geometric path projects reference images into the target camera perspective using depth maps, establishing the spatial layout of the generated scene. A semantic path encodes reference images into latent representations to preserve appearance details such as building textures and road surface markings.
A virtual lookahead mechanism retrieves Street View images located further ahead along the planned route and uses them as error-free landmarks during generation. This prevents the accumulated positional drift that typically degrades spatial coherence in long-distance video generation across current world models.
Working with real-world images introduced a challenge absent from purely synthetic approaches: Street View captures include parked cars and pedestrians that bear no relation to the dynamic scene being generated. Without a fix, the model would copy these random objects directly into output frames. The solution — cross-temporal pairing — combines reference images and target sequences captured at different times during training, teaching the model to treat building facades and road surfaces as persistent features while disregarding transient objects. In ablation studies, cross-temporal pairing was identified as the single most effective component of the training design.
Who’s Affected
Autonomous driving developers represent the most direct application. World models are used to generate synthetic test environments at scale, and spatially hallucinated geometry introduces risk when validating perception or path-planning systems. SWM’s ability to reproduce specific real routes makes it a candidate for scenario generation grounded in actual road networks rather than procedurally invented ones.
Urban planning visualization, game environment generation, and visual effects production requiring spatially accurate city footage are additional use cases. Naver’s own mapping and navigation products provide both the data source and a natural commercial deployment context for the research.
What’s Next
SWM demonstrated generalization beyond Seoul by producing coherent footage of Busan, South Korea and Ann Arbor, Michigan without additional fine-tuning — a result the researchers describe as evidence of broader applicability. A systematic evaluation of how model performance scales with Street View density in less-mapped cities has not yet been published.
One structural limitation noted in the source reporting is camera placement: Naver Map imagery is collected from road-mounted vehicle cameras, which means pedestrian zones, parks, and building interiors fall outside the model’s retrieval database. Individual researcher names were not listed in the accessible source material at time of publication.
Related Reading
- Higgsfield AI Review 2026: Social-First AI Video Creation for Short-Form Content
- Alibaba Cloud Leads $293 Million Round in Chinese AI Video Startup ShengShu Technology
- Veo 3 Review 2026: Google DeepMind AI Video Generator With Audio and Dialogue
- Netflix Releases VOID, Its First Public AI Model for Removing Objects From Video
- Self-Distillation Boosts Code Generation by 30%: No Teacher Model or RL Required