
SKYREELS-V4
SkyReels Reference-to-Video creates videos from reference images, keeping characters and scenes consistent across shots for branded ads and storytelling.
Avg Run Time: 0.000s
Model Slug: skyreels-v4-reference-to-video
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Skyreels v4 | Reference to Video Overview
Skyreels v4 | Reference to Video from Skywork AI revolutionizes content creation by generating synchronized video and audio from reference inputs in a single forward pass, solving the challenge of producing cohesive multimedia clips efficiently. As part of the Skyreels family, this open-source model stands out as the first to co-generate high-quality 1080p video at 32 FPS with integrated audio, up to 15 seconds long, eliminating the need for separate audio post-production.
Developed by Skywork AI and released on April 3, 2026, Skyreels v4 | Reference to Video leverages a Dual-stream Multimodal Diffusion Transformer (MMDiT) architecture, making it ideal for creators seeking joint audio-video outputs. Available with 70 free credits monthly on platforms like ComfyUI or Model Studio, it offers accessible entry into advanced reference-to-video generation via the Skyreels v4 | Reference to Video API.
This model excels in scenarios requiring realistic motion and sound alignment, setting it apart in the open-source landscape for reference-to-video tasks on each::labs.
Technical Specifications
Technical Specifications
- Max Resolution: 1080p native
- Frame Rate: 32 FPS
- Max Duration: Up to 15 seconds
- Architecture: Dual-stream Multimodal Diffusion Transformer (MMDiT) for joint audio-video generation
- Access: Open-source; 70 free credits per month; compatible with ComfyUI / Model Studio
- Output: Synchronized video and audio in a single pass
- Performance: Elo score ~1,135 on Artificial Analysis T2V with audio benchmark
These specs enable efficient reference-to-video generation, with processing optimized for open-source deployment.
Key Considerations
Key Considerations
Before using Skyreels v4 | Reference to Video, ensure access to ComfyUI or Model Studio, as it's optimized for these open-source environments with 70 free monthly credits. Users need reference inputs like images or initial video frames to guide generation, making it best for extending or modifying existing media rather than pure text-to-video.
Ideal for projects prioritizing joint audio-video sync over longer durations, it offers superior cost-efficiency as a free open-source option compared to proprietary models. Consider hardware requirements for local inference, as the MMDiT architecture demands GPU resources for 1080p outputs. Tradeoffs include clip length limits versus high-fidelity audio integration.
Tips & Tricks
Tips and Tricks
For optimal results with Skyreels v4 | Reference to Video, provide clear reference images or short video clips as inputs to leverage its reference-to-video strengths, focusing prompts on motion and audio cues like "extend this dance scene with upbeat music syncing to footsteps."
Optimize parameters by setting durations under 15 seconds and aspect ratios matching 16:9 for 1080p stability. Use descriptive prompts emphasizing synchronization, such as "Generate a 10-second clip from this reference image of a singer, adding harmonious vocals and lip-sync at 32 FPS." Experiment with strength settings in ComfyUI to balance reference fidelity and creative variation.
Workflow tip: Chain generations autoregressively for longer sequences, starting with a strong reference frame. "From this video reference of a car chase, extend with engine roars and screeching tires for 12 seconds." This enhances coherence in Skywork AI reference-to-video tasks.
Capabilities
Capabilities
- Joint audio-video generation in a single forward pass for perfect synchronization
- 1080p resolution at 32 FPS from reference inputs like images or video clips
- Clips up to 15 seconds with high motion quality and realistic sound
- Dual-stream MMDiT architecture enabling multimodal outputs
- Open-source deployment via ComfyUI or Model Studio
- Strong benchmark performance (Elo ~1,135 on T2V with audio)
- Reference-guided extension for consistent character and scene continuity
- Free tier with 70 credits monthly for accessible testing
What Can I Use It For?
Use Cases for Skyreels v4 | Reference to Video
Content Creators: Extend short reference clips into full scenes with synced audio, e.g., "From this 5-second dance reference, generate 12 seconds with matching rhythm music and crowd cheers" – perfect for TikTok or Reels production leveraging joint generation.
Marketers: Create product demo videos from reference images, like "Animate this product shot with explanatory voiceover and subtle sound effects for 10 seconds," enhancing ads with realistic audio without extra editing.
Developers: Integrate via Skyreels v4 | Reference to Video API in apps for dynamic video prototypes, using references for custom avatars: "Extend this face reference into a talking head with scripted dialogue at 32 FPS."
Designers: Prototype motion graphics from static refs, such as "Add fluid animations and ambient sounds to this UI mockup reference for a 15-second showcase," streamlining iterative design on each::labs.
Things to Be Aware Of
Things to Be Aware Of
Skyreels v4 | Reference to Video may struggle with complex multi-shot sequences beyond 15 seconds, as it's optimized for single-pass joint generation. Common mistakes include vague references leading to inconsistent audio sync; always use high-quality inputs.
Edge cases like rapid motion or abstract styles can reduce fidelity, and local runs require sufficient GPU VRAM for 1080p. Overly long prompts may dilute focus, so prioritize key descriptors. Monitor credit usage on free tiers for batch workflows.
Limitations
Limitations
Skyreels v4 | Reference to Video is capped at 15-second clips and 1080p, unsuitable for feature-length or 4K needs. It relies heavily on reference quality, failing on poor inputs with artifacts in motion or audio desync.
No native support for multi-modal inputs beyond basic references, and autoregressive extension for longer videos risks quality degradation. Open-source nature means variable performance on consumer hardware.
---
Pricing
Pricing Type: Dynamic
Cost equals the credits reported in the provider response multiplied by $0.01. Per-second rates without video input - fast: $0.08 (480p) / $0.11 (720p) / $0.275 (1080p). std: $0.11 (480p) / $0.14 (720p) / $0.35 (1080p). With video input - fast: $0.15 (480p) / $0.20 (720p) / $0.50 (1080p). std: $0.18 (480p) / $0.25 (720p) / $0.625 (1080p).
Current Pricing
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
Dev questions, real answers.
SkyReels Reference-to-Video is a model from Skywork AI that generates videos from one or more reference images alongside a text prompt. It keeps characters, products, and backgrounds consistent across the clip, making it useful for AI video generation that needs strong visual continuity.
SkyReels Reference-to-Video on each::labs is a fit for branded video, advertising, e-commerce, and narrative storytelling where the same character, product, or set must appear in multiple shots. Creators can guide the look with reference images and shape the scene through text instructions.
SkyReels Reference-to-Video accepts reference images so the output keeps a defined character, object, or background look across the video. Text-to-video models generate visuals from a prompt alone, while the reference-to-video model fits projects where consistency between shots is the priority.

