RESEARCH

Naver Seoul World Model Anchors Video Generation in Real City Geometry

J James Whitfield Mar 29, 2026 Updated Apr 7, 2026 4 min read
Engine Score 7/10 — Important

Naver's world model using real Street View data to prevent AI hallucinations in city generation is a novel technical approach with implications for autonomous driving and digital twin applications.

Editorial illustration for: Naver Builds First Location-Grounded World Model Using 1.2 Million Seoul Street View Images

Naver, the South Korean internet company, published research on March 29, 2026 introducing the Seoul World Model (SWM), a video generation system anchored in real city geometry drawn from over 1.2 million panoramic Street View captures of Seoul. Jonathan Kemper, writing for The Decoder, described the system as what the researchers call “the first world model tied to a real physical location.” Unlike existing video world models that generate plausible but entirely fictional environments, SWM retrieves actual street-level imagery as spatial anchors during generation.

  • SWM draws on 1.2 million Naver Map Street View panoramas to ground video output in real Seoul geography, retrieved via geographic coordinate input.
  • A cross-temporal pairing mechanism trains the model to distinguish permanent structures like building facades from transient objects like parked cars and pedestrians.
  • In benchmarks, SWM outperformed six current video world models on both visual quality and temporal consistency.
  • The model generalized to Busan, South Korea and Ann Arbor, Michigan without any additional fine-tuning or new data collection.

What Happened

Researchers from Naver and Naver Cloud published a paper describing the Seoul World Model, a location-grounded video generation system that takes geographic coordinates, a desired camera path, and a text prompt as input. The model queries a database of 1.2 million panoramic images from Naver Map, retrieves the nearest Street View images, and uses them as visual guides for step-by-step video generation. The full research paper and methodology are detailed in Jonathan Kemper’s March 29, 2026 report at The Decoder.

The system also accepts text-prompt modifications to generated footage — users can instruct the model to add burning vehicles or insert fictional elements into street-level scenes — while preserving the underlying spatial accuracy of the real route.

Why It Matters

Previous video world models produce visually convincing output, but everything beyond the starting frame — unseen intersections, distant building facades, continuation of a route — is hallucinated from learned distributions of generic scenes rather than actual geography. This limits their use in safety-critical simulation, particularly for testing autonomous vehicle perception systems against specific real-world road layouts, where a model inventing nonexistent infrastructure introduces unacceptable risk.

Naver operates South Korea’s dominant search engine and its own mapping service, Naver Map, which collects street-level panoramas in a format comparable to Google Maps. That existing data infrastructure gives the company a direct path to the imagery SWM requires, without building a separate data collection pipeline.

Technical Details

SWM is built on Nvidia’s Cosmos-Predict2.5-2B diffusion transformer architecture, a 2-billion-parameter model trained across 24 Nvidia H100 GPUs. Training data combines three distinct sources: 440,000 Seoul Street View panoramas drawn from a pool of 1.2 million images captured at intervals of every 5 to 20 meters, 12,700 synthetic videos generated using the CARLA simulator running on Unreal Engine, and real-world driving footage from the public Waymo dataset.

The model uses a dual-path integration architecture. A geometric path projects reference images into the target camera perspective using depth maps, establishing the spatial layout of the generated scene. A semantic path encodes reference images into latent representations to preserve appearance details such as building textures and road surface markings.

A virtual lookahead mechanism retrieves Street View images located further ahead along the planned route and uses them as error-free landmarks during generation. This prevents the accumulated positional drift that typically degrades spatial coherence in long-distance video generation across current world models.

Working with real-world images introduced a challenge absent from purely synthetic approaches: Street View captures include parked cars and pedestrians that bear no relation to the dynamic scene being generated. Without a fix, the model would copy these random objects directly into output frames. The solution — cross-temporal pairing — combines reference images and target sequences captured at different times during training, teaching the model to treat building facades and road surfaces as persistent features while disregarding transient objects. In ablation studies, cross-temporal pairing was identified as the single most effective component of the training design.

Who’s Affected

Autonomous driving developers represent the most direct application. World models are used to generate synthetic test environments at scale, and spatially hallucinated geometry introduces risk when validating perception or path-planning systems. SWM’s ability to reproduce specific real routes makes it a candidate for scenario generation grounded in actual road networks rather than procedurally invented ones.

Urban planning visualization, game environment generation, and visual effects production requiring spatially accurate city footage are additional use cases. Naver’s own mapping and navigation products provide both the data source and a natural commercial deployment context for the research.

What’s Next

SWM demonstrated generalization beyond Seoul by producing coherent footage of Busan, South Korea and Ann Arbor, Michigan without additional fine-tuning — a result the researchers describe as evidence of broader applicability. A systematic evaluation of how model performance scales with Street View density in less-mapped cities has not yet been published.

One structural limitation noted in the source reporting is camera placement: Naver Map imagery is collected from road-mounted vehicle cameras, which means pedestrian zones, parks, and building interiors fall outside the model’s retrieval database. Individual researcher names were not listed in the accessible source material at time of publication.

Related Reading

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime