BLOG

LTX 2.3 Generates 4K Video With Audio in One Pass — and It’s Open Source

M MegaOne AI Mar 31, 2026 Updated Apr 2, 2026 3 min read
Engine Score 7/10 — Important
Editorial illustration for: LTX 2.3 Generates 4K Video With Audio in One Pass — and It's Open Source
  • LTX 2.3 is a 22-billion-parameter open-source model that generates native 4K video with synchronized stereo audio in a single pass.
  • The Diffusion Transformer architecture splits roughly 14 billion parameters for video processing and 5 billion for audio generation.
  • Released under Apache 2.0 licensing, it permits commercial use without restriction for companies under $10 million annual revenue.
  • LTX 2.3 runs approximately 18x faster than Wan 2.2 at equivalent quality settings.

What Happened

Lightricks released LTX 2.3 on March 5, 2026, an open-source Diffusion Transformer model that generates 4K video with synchronized stereo audio from a single unified architecture. It is the first open-weight video generation model to combine video and audio synthesis in one pass, eliminating the need for separate audio generation pipelines or post-production synchronization steps.

Two model variants are available on Hugging Face. LTX 2.3-22B-dev is the full-precision bf16 version designed for fine-tuning, research, and maximum quality output. LTX 2.3-22B-distilled is an optimized 8-step variant with reduced memory requirements for faster inference. Both variants use the Gemma 3 text encoder at 12 billion parameters with quantization support for efficient prompt processing.

Why It Matters

Most video generation models in the current market output at 1080p resolution or lower and require a completely separate model or external service to add synchronized audio. LTX 2.3 generates native 4K output with stereo 24 kHz audio in a single generation step, consolidating what was previously a multi-model pipeline into one inference pass. This reduces both computational cost and workflow complexity for production environments.

The open-source release under Apache 2.0 means independent creators, small studios, and researchers can run the model locally without paying per-generation API fees. On the Artificial Analysis video generation leaderboard, LTX 2.3 ranks as the top open-source video model. Against proprietary competitors, it places behind Kling 3.0 (Elo 1,244) and Runway Gen-4.5 (Elo 1,225) on overall quality metrics, but distinguishes itself by offering native 4K resolution where most competitors cap at standard 1080p output.

Technical Details

The 22-billion-parameter Diffusion Transformer architecture divides into approximately 14 billion parameters dedicated to video processing and 5 billion parameters handling audio generation. Video output supports both 24 FPS and 48 FPS frame rates in 16:9 landscape and 9:16 portrait aspect ratios, with a maximum clip duration of 20 seconds per single generation.

An upgraded HiFi-GAN vocoder replaced the previous audio decoder in the 2.3 release, delivering cleaner audio output and eliminating prior artifacts and unwanted silence gaps. Synchronized audio generation works most reliably for ambient environmental sounds, background music, and simple sound effects. Complex dialogue synthesis and speech generation remain inconsistent in the current version and are not considered production-ready.

Local deployment requires an NVIDIA GPU with 44GB VRAM for full 4K generation at fp16 precision, or approximately 24GB for FP8 quantized variants. This makes the full model viable on high-end consumer hardware like the RTX 4090 and RTX 5090, while the quantized version runs on more widely available GPUs. The distilled pipeline variant achieves the fastest inference using only 8 predefined sigma steps. Speed benchmarks show LTX 2.3 running approximately 18x faster than Wan 2.2 at equivalent quality settings.

Who’s Affected

Independent video creators and small production studios gain access to 4K AI video generation without recurring API costs or usage-based pricing. The Apache 2.0 license permits unrestricted commercial use for companies generating under $10 million in annual revenue. Larger enterprises require a separate commercial licensing agreement with Lightricks.

Researchers and practitioners working on video model fine-tuning benefit from the full-precision dev variant and the included support for LoRA and IC-LoRA training workflows. The model also supports multiple generation modes including text-to-video, image-to-video, video-to-video editing, keyframe interpolation, and selective region regeneration for targeted content modification.

What’s Next

Audio quality for dialogue and intelligible speech remains the primary limitation preventing full production adoption for narrative content. While ambient sound and music synchronize reliably, generating clear spoken words within video output is not yet consistent enough for professional use. The 44GB VRAM requirement for full 4K generation also limits local deployment to high-end consumer GPUs or workstation hardware, keeping cloud-based inference relevant for most users who lack dedicated GPU resources. Lightricks offers API access for teams that prefer not to manage local GPU infrastructure. The company has not announced a timeline for improved speech synthesis or plans for resolution beyond 4K.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy