AI Models - ltx/ltx-v2-3

ltx/ltx-v2-3

The **ltx-v2.3** family, developed by Lightricks as part of the LTX-Video series, represents a cutting-edge suite of multimodal AI models specialized in **video generation**. This 22B parameter diffusion transformer (DiT)-based system excels at creating high-quality videos from text, images, audio, or video inputs, solving the challenge of producing professional-grade cinematic content with synchronized audio in a single pass. Unlike traditional tools that require separate steps for visuals and sound, ltx-v2.3 handles **text-to-video**, **image-to-video**, **audio-to-video**, and **video-to-video** natively, delivering up to **native 4K resolution at 50fps**. The family powers local and cloud-based workflows, enabling creators to generate clips up to 20 seconds in Fast mode or 10 seconds in Pro mode for superior quality. Currently available via Hugging Face with millions of downloads, ltx-v2.3 supports consumer GPUs for cost-effective, private production—no watermarks, no cloud dependency.

Models

LTX 2.3 Image-to-Video turns still images into cinematic 4K video clips up to 20 seconds with synced audio, vertical framing, and 24/48 FPS.

Ltx v2.3 | Image to Video

LTX 2.3 Text-to-Video generates 4K AI video clips up to 20 seconds from text prompts with synced audio, vertical framing, and selectable 24 or 48 FPS.

Ltx v2.3 | Text to Video

LTX-V2.3 Lipsync generates a talking video using an image and an audio file. The uploaded image naturally lip-syncs to the audio while displaying realistic facial expressions.

Ltx v2.3 | Lipsync

Readme

ltx-v2.3 — AI Model Family

The ltx-v2.3 family, developed by Lightricks as part of the LTX-Video series, represents a cutting-edge suite of multimodal AI models specialized in video generation. This 22B parameter diffusion transformer (DiT)-based system excels at creating high-quality videos from text, images, audio, or video inputs, solving the challenge of producing professional-grade cinematic content with synchronized audio in a single pass. Unlike traditional tools that require separate steps for visuals and sound, ltx-v2.3 handles text-to-video, image-to-video, audio-to-video, and video-to-video natively, delivering up to native 4K resolution at 50fps. The family powers local and cloud-based workflows, enabling creators to generate clips up to 20 seconds in Fast mode or 10 seconds in Pro mode for superior quality. Currently available via Hugging Face with millions of downloads, ltx-v2.3 supports consumer GPUs for cost-effective, private production—no watermarks, no cloud dependency.

ltx-v2.3 Capabilities and Use Cases

The ltx-v2.3 family centers on a core 22B video model, augmented by specialized components like the LTX Audio VAE for synchronized audio decoding and LTX 2.3 Spatial Upscaler x2 for enhanced sharpness. These models integrate seamlessly in pipelines such as ComfyUI workflows, where text encoders like Gemma 3 12B Instruct with LoRAs refine prompt adherence.

Text-to-Video: Transform detailed descriptions into dynamic scenes. Ideal for storytelling, marketing, or concept visualization. Example prompt: "A man in a blue jacket walks down a rain-soaked Tokyo street at dusk, neon signs reflecting in puddles, shot from a low angle, 24mm lens perspective." This yields realistic motion, lighting, and camera effects in 4K.
Image-to-Video: Animate stills with coherent, cinematic motion. Perfect for product demos or social media reels—upload a photo and add fluid camera pans or character actions while preserving visual fidelity.
Audio-to-Video: Generate visuals synced to voiceovers, music, or ambient sounds. Podcasters can input narration to create matching footage, with automatic rhythm and energy alignment, eliminating manual syncing.
Video-to-Video: Refine existing clips by extending, stylizing, or transferring motion. Use a frame from one generation as input for iterative improvements, like enhancing motion or adding details.

These capabilities shine in pipelines: Start with text-to-video for a base clip, extract a keyframe for image-to-video refinement, then mux with audio via the LTX Audio VAE. Technical specs include 4K/50fps output, improved VAE for detailed renders without oversaturation, LoRA support for styles, camera/pose control, and cross-platform GPU optimization. Durations cap at 20s (Fast) or 10s (Pro), with features like first/last frame conditioning for precise control.

What Makes ltx-v2.3 Stand Out

ltx-v2.3 distinguishes itself through true multimodality—one model processes text, images, audio, and video without tool-switching, producing synchronized video + audio in a single inference pass. The upgraded VAE delivers sharper details, better portrait support, and consistent motion, addressing flaws like oversaturation in prior versions. High frame rates and native 4K enable cinematic quality on consumer hardware, with privacy-first local runs (no data leaves your machine) and optional cloud fallback.

Key strengths include speed and cost-efficiency (1/5 to 1/10 cloud costs post-hardware), controllability via LoRAs, pose-driven motion, and camera angles, plus batch generation without rate limits. It responds precisely to prompts specifying lenses, lighting, and movement, ensuring reliable, high-coherence outputs. This family suits filmmakers, marketers, podcasters, game developers, and enterprises needing proprietary asset handling—anyone prioritizing quality, iteration speed, and ownership over subscription models.

Access ltx-v2.3 Models via each::labs API

each::labs is the premier platform for deploying the full ltx-v2.3 model family through a unified API, granting instant access to text-to-video, image-to-video, and multimodal pipelines without setup hassles. Run the 22B core model, Audio VAE, Spatial Upscaler, and LoRAs in scalable cloud environments or integrate locally via SDK. Experiment in the interactive Playground to test prompts like rain-slicked Tokyo streets, then scale to production with simple API calls.

Sign up to explore the full ltx-v2.3 model family on each::labs.

ltx/ltx-v2-3 models