Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Skywork - Video Generation

SkyReels-V3: Open-Source Reference-to-Video and Talking Avatars

SkyReels-V3 by Skywork AI ships three open-weight models: multi-subject reference-to-video (14B), audio-driven talking avatars (19B), and video extension with cinematic shot switching. Built on Wan 2.1, 720P at 24fps, Skywork Community License.

License Other
TL;DR
  • Three open-weight models: Reference-to-Video (14B), Talking Avatar (19B), and Video Extension with cinematic shot switching (14B).
  • Top open-source scores on reference consistency (0.6698) and visual quality (0.8119), competitive with Kling 1.6 and OmniHuman 1.5.
  • Built on Wan 2.1. Complements V2 (text-to-video + diffusion forcing). ComfyUI, GGUF, and FP8 quantization supported.
System Requirements
RAM32GB
GPURTX 4090 24GB / A100 80GB
VRAM24GB+ (14B) / 24GB+ (19B A2V)

On January 29, 2026, Skywork AI (Kunlun Tech, Beijing) released SkyReels-V3, the latest in their open-weight video generation family. Where V2 introduced Diffusion Forcing for infinite-length text-to-video, V3 shifts focus to three new capabilities: multi-subject reference-to-video (drop in 1 to 4 reference images and get a video), audio-driven talking avatars (portrait plus audio, up to 200 seconds), and video extension with cinematic shot switching. All three models ship as open weights on Hugging Face, built on the Wan 2.1 backbone.

Three Models, Three Capabilities

V3 is not a single model. It ships as three separate checkpoints, each trained for a specific task:

  • R2V-14B (Reference-to-Video): Takes 1 to 4 reference images of characters, objects, or backgrounds plus a text prompt and generates a 720P video at 24fps. Reference images are encoded through the Wan 2.1 VAE and concatenated with video latents. A cross-frame pairing strategy with image editing extracts subjects while completing backgrounds, so results avoid the "copy-paste" look of naive conditioning.
  • A2V-19B (Talking Avatar): Takes a single portrait image plus an audio clip and generates a lip-synced talking video. Supports up to 200 seconds of continuous output. Works across real-life, cartoon, animal, and stylized characters in Chinese, English, Korean, and even singing. The extra 5B parameters over the 14B variants come from the Wav2Vec2 audio encoder, CLIP vision encoder, and audio-visual attention layers.
  • V2V-14B (Video Extension): Extends an existing video while preserving motion continuity and subject identity. Two modes: single-shot extension (up to 30 seconds) and shot switching (5 seconds) with cinematic transitions like cut-in, cut-out, shot/reverse shot, multi-angle, and cut-away.

How It Works Under the Hood

V3 uses a unified multimodal in-context learning framework built on Diffusion Transformers (DiT) with the Wan 2.1 backbone. The text encoder is UMT5-XXL (~11.4GB in BF16). For inference scheduling, V3 uses Flow Matching with a FlowMatchEulerDiscreteScheduler instead of the DDPM approach used in V2.

The talking avatar pipeline is the most complex. Audio goes through a Wav2Vec2 encoder, then an AudioProjModel maps audio features to context tokens per video frame. These tokens feed into dedicated audio-visual attention layers alongside the portrait image conditioning. A key-frame constrained generation approach first structures content at key points, then fills transitions smoothly.

The video extension pipeline uses unified multi-segment positional encoding to handle multiple video segments. For shot switching, a dedicated shot transformer classifies transition types, and prompts use prefixes like [ZOOM_IN_CUT] or [CUT_AWAY] to control the cinematic style.

Benchmarks

Skywork reports benchmark results for reference-to-video and talking avatar against recent competitors:

Reference-to-Video

ModelRef. ConsistencyInstruction FollowingVisual Quality
SkyReels V30.669827.220.8119
Kling 1.60.663029.230.8034
PixVerse V50.654229.340.7976
Vidu Q20.596127.840.7877

V3 leads on reference consistency and visual quality. It trails Kling 1.6 and PixVerse on instruction following, meaning the model sometimes prioritizes subject fidelity over prompt adherence. For workflows where the reference images matter more than the text prompt, that is the right tradeoff.

Talking Avatar

ModelAudio-Visual SyncVisual QualityCharacter Consistency
OmniHuman 1.58.254.600.81
SkyReels V38.184.600.80
KlingAvatar8.014.550.78
HunyuanAvatar6.724.500.74

V3 ties for best visual quality and is within striking distance of OmniHuman 1.5 on sync and consistency. Crucially, V3 is the only model in this comparison with open weights.

Hardware Requirements

SetupVRAMNotes
Full precision 720P24+ GBRTX 4090 or A100
With --offload~24 GBCPU offloading for model components
With --low_vram<24 GBFP8 quantization + block offload
GGUF Q4~12-14 GBCommunity quantizations on HF
Multi-GPU (xDiT)4x GPUsCannot combine with --low_vram

The 14B variants (R2V, V2V) need roughly 52-69 GB of storage. The 19B A2V model needs about 56 GB. Generation takes roughly 1 to 3 minutes per 4-second clip at 720P on an RTX 4090. GGUF quantized versions from vantagewithai range from Q2_K (~6 GB) to Q8_0 (~16-21 GB).

Get It Running

Clone the repo and install dependencies (requires Python 3.12+, CUDA 12.8+):

git clone https://github.com/SkyworkAI/SkyReels-V3.git
cd SkyReels-V3
pip install -r requirements.txt

Generate a reference-to-video clip with a single reference image:

python generate_video.py \
  --task_type r2v \
  --model_id Skywork/SkyReels-V3-R2V-14B \
  --prompt "A woman walking through a sunlit garden" \
  --image ref_portrait.jpg \
  --offload

Generate a talking avatar from a portrait and audio file:

python generate_video.py \
  --task_type a2v \
  --model_id Skywork/SkyReels-V3-A2V-19B \
  --image portrait.jpg \
  --audio speech.wav \
  --prompt "A person speaking naturally, indoor setting" \
  --offload

For ComfyUI users, Kijai's WanVideo Wrapper has FP8 scaled models for all three V3 variants. R2V works in Phantom workflows, V2V in Pusa workflows, and A2V directly in WanVideoWrapper nodes. FP8 models are available at Kijai/WanVideo_comfy_fp8_scaled on Hugging Face.

V3 vs V2: Complements, Not Replaces

V3 does not replace V2. They serve different purposes:

  • V2 remains the go-to for text-to-video, image-to-video, and infinite-length generation via Diffusion Forcing. If you need to type a prompt and get a video, use V2.
  • V3 is for workflows that start with reference material: existing character images, audio recordings, or video clips that need extension. If you have assets and want to build on them, use V3.

Both share the Wan 2.1 backbone and Skywork Community License, so tooling (ComfyUI, quantization) transfers between them.

Limitations

The Skywork Community License allows commercial use but is not OSI-approved. Read the full terms before production deployment. The technical paper (arXiv:2601.17323) is light on training details: no dataset sizes, no ablation studies, no loss function specs. You get weights and inference code but not the full recipe.

The A2V talking avatar model at 19B parameters is the heaviest variant. Without FP8 or GGUF quantization, expect to need 24GB+ of VRAM. Audio-visual sync, while competitive with OmniHuman 1.5, still shows occasional drift on longer clips. Shot switching in V2V is limited to 5-second transitions.

What Comes Next: V4

Skywork published the SkyReels-V4 paper in February 2026. V4 promises joint video and audio generation (synchronized sound alongside video), plus inpainting and editing in a single unified model. Specs in the paper: 1080p at 32fps, 15-second clips. Weights are not yet public, but given Skywork's track record of releasing V1 through V3, the community expects open weights eventually.

Why This Matters

SkyReels-V3 fills three gaps that no other single open-weight project covers: character-consistent video from reference images, production-quality talking avatars from audio, and cinematic video extension with shot transitions. The talking avatar capability alone puts it in competition with closed APIs from Kling and OmniHuman, except you can run it on your own hardware.

Download the R2V-14B or A2V-19B weights, point them at a 24GB GPU with --offload, and test against your own reference images or audio clips. If you also need text-to-video or infinite-length generation, pair it with SkyReels-V2.

 

Prev Article
Overworld Waypoint-1.5
Next Article
NVIDIA Ising: Open-Source Quantum AI Models

Related to this topic: