SkyReels-V3: Open-Weight Multi-Subject Video, Talking Avatars, and Cinematic Extension From Skywork AI

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Log in

Have no account yet? Sign up

Create an account

Already have an account? Log in

Reset password

Remember your password? Log in

Terms of use

SingularityByte.com values the privacy of our users. Therefore, this privacy policy explains in detail how we use and protect the information we collect when you visit our website.. Read this privacy policy completely. Please refrain from visiting the site if the terms outlined below are not satisfactory to you. We reserve the right to change this policy at any time and will list these changes in the updates section of the policy. By reading this notice and visiting the site, you agree that you understand that customers will not be personally notified when this policy changes. Therefore, we advise our customers to frequently review our privacy policy so that they remain aware of its updates. By using the site, you accept that the posted policy and all its changes apply to your interaction with SingularityByte.com.

Information Collected by SingularityByte.com

Personal information may be collected by this site in many ways. This information includes:

Personal identifying information like your name, address, email, phone number, age, gender, and other personal data
Server data related to the IP address you used to visit our website, which includes your address, browser, OS, access time, and site activity.
Financial information related to your orders including your payment method and identifying payment information. We rarely store financial information collected on our site for transaction purposes. That information gets sent directly to our payment processor.
Social network data including Facebook permissions and user information from other networks, provided you log onto our site using one of these media sites.
Mobile device information such as your device ID, model, and location, if you use our site by accessing trough our website.

How We Use This Information

Our website uses information collected to:
• Manage your account information
• Customize ads
• Deliver promotions
• Email your account confirmation
• Manage purchases and payments
• Increase site efficiency
• Notify you of updates
• Offer new products
• Monitor and prevent theft
• Request your customer feedback
• Resolve account disputes
• Respond to your service requests

Information Disclosure

Normally, your information stays on our site. However, below we have listed the situations that may
require us to share the information we collect from you:
• When required by law, such as for fraud protection
• With our third-party providers for payment processing and hosting
• With your consent for marketing purposes
• When you post comments on the site
• To our advertisers, affiliates, and partners
• If this site goes bankrupt and data must be transferred

Cookies, Trackers, and Online Ads

We may use cookies, trackers, web beacons, and other technology to customize our website to improve your experience. We may customize the site using this information. These trackers do not have access to your personal information and can be removed from your browser options. In addition, third-party software provides ads for our site for marketing campaigns. These programs have access to tracking technology to optimize your ad experience. For more information about these
ads, visit [link to the privacy policies of affiliate advertisers]. Website analytics such as through Google Analytics may also be used to track users
and remarket our website. We do not give these vendors access to your personal information.

Other Sites

Our website may contain links to third-party websites in the form of policies, ads, and other non-affiliated links. Once you leave our site, we are no longer responsible for how your information is collected and disclosed. Please refer to the privacy policies of those third-party sites for more information.

Information Security

We take technical and administrative precautions to protect your data, but we cannot guarantee its safety against all types of fraud or misuse. If you provide personal information, we cannot verify its total security against all types of interception.

Do-Not-Track

Some browsers offer Do-Not-Track settings to prevent any information from being distributed. Since these settings have not been legally established as standard practice, we do acknowledge these settings.

Additional Options

At any time, you may opt to review or change your account settings, including contact information. If you wish to delete your account, you may do so to remove most of your information, however, some identifying information will be retained to prevent fraud.
You may also opt-out of emails and other correspondences from our site at any time.

Microsoft Clarity

We partner with Microsoft Clarity and Microsoft Advertising to capture how you use and interact with our website through behavioral metrics, heatmaps, and session replay to improve and market our products/services. Website usage data is captured using first and third-party cookies and other tracking technologies to determine the popularity of products/services and online activity. Additionally, we use this information for site optimization, fraud/security purposes, and advertising. For more information about how Microsoft collects and uses your data, visit the Microsoft Privacy Statement.

Contact Us

If you have questions or concerns about this privacy policy, please feel free to contact us at: desk@SingularityByte.com

Do you agree to our terms? Sign up

License Other

TL;DR

Three open-weight models: Reference-to-Video (14B), Talking Avatar (19B), and Video Extension with cinematic shot switching (14B).
Top open-source scores on reference consistency (0.6698) and visual quality (0.8119), competitive with Kling 1.6 and OmniHuman 1.5.
Built on Wan 2.1. Complements V2 (text-to-video + diffusion forcing). ComfyUI, GGUF, and FP8 quantization supported.

☍ Announcement ⬇ Download Model

System Requirements

RAM	32GB
GPU	RTX 4090 24GB / A100 80GB
VRAM	24GB+ (14B) / 24GB+ (19B A2V)

Table of Contents

On January 29, 2026, Skywork AI (Kunlun Tech, Beijing) released SkyReels-V3, the latest in their open-weight video generation family. Where V2 introduced Diffusion Forcing for infinite-length text-to-video, V3 shifts focus to three new capabilities: multi-subject reference-to-video (drop in 1 to 4 reference images and get a video), audio-driven talking avatars (portrait plus audio, up to 200 seconds), and video extension with cinematic shot switching. All three models ship as open weights on Hugging Face, built on the Wan 2.1 backbone.

Three Models, Three Capabilities

V3 is not a single model. It ships as three separate checkpoints, each trained for a specific task:

R2V-14B (Reference-to-Video): Takes 1 to 4 reference images of characters, objects, or backgrounds plus a text prompt and generates a 720P video at 24fps. Reference images are encoded through the Wan 2.1 VAE and concatenated with video latents. A cross-frame pairing strategy with image editing extracts subjects while completing backgrounds, so results avoid the "copy-paste" look of naive conditioning.
A2V-19B (Talking Avatar): Takes a single portrait image plus an audio clip and generates a lip-synced talking video. Supports up to 200 seconds of continuous output. Works across real-life, cartoon, animal, and stylized characters in Chinese, English, Korean, and even singing. The extra 5B parameters over the 14B variants come from the Wav2Vec2 audio encoder, CLIP vision encoder, and audio-visual attention layers.
V2V-14B (Video Extension): Extends an existing video while preserving motion continuity and subject identity. Two modes: single-shot extension (up to 30 seconds) and shot switching (5 seconds) with cinematic transitions like cut-in, cut-out, shot/reverse shot, multi-angle, and cut-away.

How It Works Under the Hood

V3 uses a unified multimodal in-context learning framework built on Diffusion Transformers (DiT) with the Wan 2.1 backbone. The text encoder is UMT5-XXL (~11.4GB in BF16). For inference scheduling, V3 uses Flow Matching with a FlowMatchEulerDiscreteScheduler instead of the DDPM approach used in V2.

The talking avatar pipeline is the most complex. Audio goes through a Wav2Vec2 encoder, then an AudioProjModel maps audio features to context tokens per video frame. These tokens feed into dedicated audio-visual attention layers alongside the portrait image conditioning. A key-frame constrained generation approach first structures content at key points, then fills transitions smoothly.

The video extension pipeline uses unified multi-segment positional encoding to handle multiple video segments. For shot switching, a dedicated shot transformer classifies transition types, and prompts use prefixes like [ZOOM_IN_CUT] or [CUT_AWAY] to control the cinematic style.

Benchmarks

Skywork reports benchmark results for reference-to-video and talking avatar against recent competitors:

Reference-to-Video

Model	Ref. Consistency	Instruction Following	Visual Quality
SkyReels V3	0.6698	27.22	0.8119
Kling 1.6	0.6630	29.23	0.8034
PixVerse V5	0.6542	29.34	0.7976
Vidu Q2	0.5961	27.84	0.7877

V3 leads on reference consistency and visual quality. It trails Kling 1.6 and PixVerse on instruction following, meaning the model sometimes prioritizes subject fidelity over prompt adherence. For workflows where the reference images matter more than the text prompt, that is the right tradeoff.

Talking Avatar

Model	Audio-Visual Sync	Visual Quality	Character Consistency
OmniHuman 1.5	8.25	4.60	0.81
SkyReels V3	8.18	4.60	0.80
KlingAvatar	8.01	4.55	0.78
HunyuanAvatar	6.72	4.50	0.74

V3 ties for best visual quality and is within striking distance of OmniHuman 1.5 on sync and consistency. Crucially, V3 is the only model in this comparison with open weights.

Hardware Requirements

Setup	VRAM	Notes
Full precision 720P	24+ GB	RTX 4090 or A100
With `--offload`	~24 GB	CPU offloading for model components
With `--low_vram`	<24 GB	FP8 quantization + block offload
GGUF Q4	~12-14 GB	Community quantizations on HF
Multi-GPU (xDiT)	4x GPUs	Cannot combine with --low_vram

The 14B variants (R2V, V2V) need roughly 52-69 GB of storage. The 19B A2V model needs about 56 GB. Generation takes roughly 1 to 3 minutes per 4-second clip at 720P on an RTX 4090. GGUF quantized versions from vantagewithai range from Q2_K (~6 GB) to Q8_0 (~16-21 GB).

Get It Running

Clone the repo and install dependencies (requires Python 3.12+, CUDA 12.8+):

git clone https://github.com/SkyworkAI/SkyReels-V3.git
cd SkyReels-V3
pip install -r requirements.txt

Generate a reference-to-video clip with a single reference image:

python generate_video.py \
  --task_type r2v \
  --model_id Skywork/SkyReels-V3-R2V-14B \
  --prompt "A woman walking through a sunlit garden" \
  --image ref_portrait.jpg \
  --offload

Generate a talking avatar from a portrait and audio file:

python generate_video.py \
  --task_type a2v \
  --model_id Skywork/SkyReels-V3-A2V-19B \
  --image portrait.jpg \
  --audio speech.wav \
  --prompt "A person speaking naturally, indoor setting" \
  --offload

For ComfyUI users, Kijai's WanVideo Wrapper has FP8 scaled models for all three V3 variants. R2V works in Phantom workflows, V2V in Pusa workflows, and A2V directly in WanVideoWrapper nodes. FP8 models are available at Kijai/WanVideo_comfy_fp8_scaled on Hugging Face.

V3 vs V2: Complements, Not Replaces

V3 does not replace V2. They serve different purposes:

V2 remains the go-to for text-to-video, image-to-video, and infinite-length generation via Diffusion Forcing. If you need to type a prompt and get a video, use V2.
V3 is for workflows that start with reference material: existing character images, audio recordings, or video clips that need extension. If you have assets and want to build on them, use V3.

Both share the Wan 2.1 backbone and Skywork Community License, so tooling (ComfyUI, quantization) transfers between them.

Limitations

The Skywork Community License allows commercial use but is not OSI-approved. Read the full terms before production deployment. The technical paper (arXiv:2601.17323) is light on training details: no dataset sizes, no ablation studies, no loss function specs. You get weights and inference code but not the full recipe.

The A2V talking avatar model at 19B parameters is the heaviest variant. Without FP8 or GGUF quantization, expect to need 24GB+ of VRAM. Audio-visual sync, while competitive with OmniHuman 1.5, still shows occasional drift on longer clips. Shot switching in V2V is limited to 5-second transitions.

What Comes Next: V4

Skywork published the SkyReels-V4 paper in February 2026. V4 promises joint video and audio generation (synchronized sound alongside video), plus inpainting and editing in a single unified model. Specs in the paper: 1080p at 32fps, 15-second clips. Weights are not yet public, but given Skywork's track record of releasing V1 through V3, the community expects open weights eventually.

Why This Matters

SkyReels-V3 fills three gaps that no other single open-weight project covers: character-consistent video from reference images, production-quality talking avatars from audio, and cinematic video extension with shot transitions. The talking avatar capability alone puts it in competition with closed APIs from Kling and OmniHuman, except you can run it on your own hardware.

Download the R2V-14B or A2V-19B weights, point them at a 24GB GPU with --offload, and test against your own reference images or audio clips. If you also need text-to-video or infinite-length generation, pair it with SkyReels-V2.