Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Alibaba - Video Generation

Self-Hosted AI Video Generator Stack: Hardware Guide 2026

Self-hosted AI video generator stack for 2026: hardware requirements, model choices, and stack guide for builders running Wan 2.1, Mochi, and HunyuanVideo locally.

License Apache 2.0
TL;DR
  • Complete Ubuntu plus CUDA plus Docker setup for a self-hosted AI video generator stack with Wan 2.1 as default.
  • Minimum RTX 4090 24GB for 480p clips. Rent an A100 80GB for 720p or longer durations.
  • Docker Compose recipe plus cost comparison of buying an RTX 4090 vs renting on Runpod or Vast.ai.
System Requirements
RAM32GB
GPURTX 4090 24GB or A100 80GB
VRAM24GB+

You have decided to stop paying per-second for hosted video APIs and run your own AI video generator on a box you control. Smart call if you have sustained throughput, regulated data, or a product that differentiates on custom weights. The rest of the job is picking the right GPU, the right model, and a stack that boots the same way every time. This self-hosted AI video generator 2026 guide walks through hardware choices, the exact install sequence on Ubuntu 24.04, a Docker Compose option, and the trade-off between owning a 4090 and renting an H100 by the hour.

Hardware Requirements for a Self-Hosted AI Video Generator

Video generation is still the most VRAM-hungry thing a developer does on a single box. Your GPU choice dictates which models you can run, at what resolution, and how fast. Here is the matrix we use when sizing a rig.

GPU            VRAM   Wan 2.1 1.3B  Wan 2.2 TI2V-5B  Wan 2.2 A14B    HunyuanVideo-1.5  HunyuanVideo (13B stock)
RTX 3090       24 GB  Yes (fast)    Yes (720p ~9m)   No              Yes (offload)     No
RTX 4090       24 GB  Yes (fast)    Yes (720p ~9m)   No              Yes (fast)        No
RTX 5090       32 GB  Yes           Yes (720p tight) No              Yes (faster)      No
A100 40GB      40 GB  Yes           Yes              No               Yes               544p only
A100 80GB      80 GB  Yes           Yes              Yes              Yes               720p ok
H100 80GB      80 GB  Yes           Yes              Yes (fastest)    Yes               720p fastest

For a balanced default in 2026 we recommend Wan 2.2 TI2V-5B on an RTX 4090. You get Apache 2.0 weights, 720p at 24fps in a single checkpoint that handles both text-to-video and image-to-video, and a clear upgrade path to an 80GB A100 for the 14B A14B flagships when you need more quality. The rest of this guide assumes that pick. HunyuanVideo-1.5 is a strong alternative on the same hardware if you care more about quality than license freedom, as long as you are not shipping into the EU, UK, or South Korea.

Why Wan 2.2 as the Default Self-Hosted AI Video Generator

Three reasons. First, Apache 2.0 licensing means you can ship commercial products without a lawyer sign-off, and with no territory carve-outs. Second, the 5B TI2V variant runs on a single 24GB consumer card at 720p 24fps, and the smaller 1.3B Wan 2.1 weights are still around for 8GB dev laptops, so your iteration and production rigs can share a toolchain. Third, the Wan-Video/Wan2.2 repo is actively maintained and the community has settled on a stable pipeline, so you are not fighting the toolchain every time you pull main.

The trade-off is that Wan is slightly behind HunyuanVideo-1.5 on pure output quality at 720p on the same hardware, and noticeably behind the closed-source Veo 3 and Kling 3.0 models on long clips and audio sync. If your product sells on cinematic fidelity, pay for a closed API. If it sells on volume, customization, or data privacy, self-host Wan and ship.

OS Setup: Ubuntu 24.04 Is the Right Default

We recommend Ubuntu 24.04 LTS. Ubuntu 24.04 has the cleanest NVIDIA driver story of the current LTS options, works with CUDA 12.4 out of the box, and the HF community publishes the most install guides against it. Debian 12 and Fedora 41 both work, but you will spend more time in forum threads when something breaks.

Start from a fresh install and apply the base updates:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git wget curl ffmpeg python3-venv python3-pip

Install the NVIDIA driver. The ubuntu-drivers helper picks a working version for your card.

sudo ubuntu-drivers install
sudo reboot
# After reboot, verify:
nvidia-smi

You should see your GPU and a driver version of 550 or higher. If nvidia-smi fails, you have a driver problem and nothing downstream will work. Fix it before moving on.

CUDA and PyTorch: The Only Versions That Matter

Wan 2.2 and most open-source AI video generator pipelines target CUDA 12.4 and PyTorch 2.5. Install CUDA via the NVIDIA APT repo and PyTorch inside a virtualenv. Do not install PyTorch system-wide, you will regret it the first time you need two environments.

python3 -m venv ~/venv-wan
source ~/venv-wan/bin/activate
pip install --upgrade pip
pip install torch==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu124

Verify the install sees your GPU:

import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

If CUDA is not available, your PyTorch build is CPU-only or your driver is mismatched. Reinstall PyTorch against the correct CUDA channel before continuing.

Downloading Wan 2.2 With the HuggingFace CLI

Model downloads are the part people trip over. The huggingface-cli is faster and more reliable than a browser download for 10GB+ checkpoints.

pip install -U "huggingface_hub[cli]"
huggingface-cli login  # paste your HF token
mkdir -p ~/models
cd ~/models
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
  --local-dir ./Wan2.2-TI2V-5B \
  --local-dir-use-symlinks False

The 5B TI2V checkpoint is around 15GB, so it lands in roughly ten minutes on a decent connection. For the 14B flagships, swap the model name to Wan-AI/Wan2.2-T2V-A14B or Wan-AI/Wan2.2-I2V-A14B and plan for an 80GB GPU, those variants do not fit on a 24GB card without heavy modifications. Download once, mount from everywhere.

Your First Video Generation

Clone the Wan repo and install its requirements inside the same virtualenv:

cd ~
git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2
pip install -r requirements.txt

Run your first generation:

python generate.py \
  --task ti2v-5B \
  --size 1280*720 \
  --ckpt_dir ~/models/Wan2.2-TI2V-5B \
  --prompt "a sleek server rack pulsing with neon data streams, slow cinematic dolly in"

On an RTX 4090, expect roughly 9 minutes for a 5 second 720p 24fps clip from the TI2V-5B model without custom optimization, based on the numbers Alibaba publishes on the Wan2.2-TI2V-5B model card. The resulting mp4 lands in the working directory. If you hit out-of-memory on the A14B variants, add --offload_model True --t5_cpu to move layers to CPU RAM at the cost of roughly 2x wall-clock time, and expect the A14B configurations to stay slow on anything short of an 80GB card.

Docker Compose Option for Repeatable Deploys

You do not want your production box to depend on whatever Python packages happened to be in your shell when you last ran pip install. Docker Compose pins everything. Here is a minimal setup that runs a Wan 2.2 inference worker behind a simple Flask endpoint.

FROM nvidia/cuda:12.4.0-cudnn-runtime-ubuntu24.04

RUN apt-get update && apt-get install -y \
    python3 python3-pip git ffmpeg \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
RUN git clone https://github.com/Wan-Video/Wan2.2.git .

RUN pip3 install --no-cache-dir \
    torch==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install --no-cache-dir -r requirements.txt
RUN pip3 install --no-cache-dir flask gunicorn

COPY serve.py /app/serve.py
EXPOSE 8000
CMD ["gunicorn", "-b", "0.0.0.0:8000", "-w", "1", "--timeout", "1800", "serve:app"]
services:
  wan-worker:
    build: .
    image: wan-video:local
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models:ro
      - ./outputs:/app/outputs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - HF_HOME=/app/models/hf-cache

The key bit is the deploy.resources.reservations.devices block that passes the GPU into the container. You will need the NVIDIA Container Toolkit installed on the host, otherwise the container boots without GPU access and silently falls back to CPU. Install it with sudo apt install nvidia-container-toolkit and restart Docker.

Rent or Own: The Real Cost Analysis

The rent-or-own question is the one every builder asks. Here is the math with current 2026 numbers.

Buying an RTX 4090. Around $1,800 used, $2,000 new if you can find one. Plus roughly $800 for a workstation to wrap it in with sufficient PSU and airflow. Call it $2,800 all-in. Amortized over 24 months, that is $117 a month before electricity.

Renting an RTX 4090 on Runpod or Vast.ai. Spot instances run roughly $0.35 to $0.55 per hour in early 2026. If you run 4 hours a day, 30 days a month, that is 120 hours, roughly $42 to $66 a month, no hardware commitment.

Renting an A100 80GB. Spot pricing lands around $1.00 to $1.50 per hour. The A100 is where the Wan 2.2 A14B flagships actually fit, and it runs Mochi-1 and the original 13B HunyuanVideo without aggressive quantization. Useful for final renders, overkill for iteration.

Renting an H100 80GB. Spot pricing sits around $2.00 to $3.00 per hour. Worth it only when wall-clock time is the blocker, for example rendering a batch of 50 clips for a client deadline.

Rule of thumb: if you run your AI video generator fewer than four hours a day on average, rent. If you run it eight-plus hours a day, own. The break-even is real and it arrives faster than most indie developers expect.

Troubleshooting: The Five Failures You Will Hit

CUDA out of memory. The most common error. Fixes in order: reduce --size, add --offload_model True, add --t5_cpu, drop the resolution to 480p, or restart the Python process between runs because fragmentation adds up. Do not trust nvidia-smi showing free VRAM, PyTorch reserves more than it shows.

Driver or CUDA version mismatch. Symptom: torch.cuda.is_available() returns False or PyTorch complains about a kernel compile. Fix: uninstall PyTorch, reinstall from the cu124 index URL, verify nvcc --version and nvidia-smi agree on the CUDA major version.

HuggingFace download interrupted. The Wan 2.2 A14B checkpoints run well over 20GB each and will occasionally fail partway. The huggingface-cli resumes safely. If you have a flaky connection, prefix the download with HF_HUB_ENABLE_HF_TRANSFER=1 for a faster multi-threaded transfer.

FFmpeg missing on Docker. The generate script writes mp4s through ffmpeg. If you forget to apt-install it in the Dockerfile, you get a confusing stack trace at the end of a successful generation. Always install ffmpeg in the container image.

Docker sees no GPU. Symptom: generation runs, but at CPU speed, which is unusably slow. Fix: install the NVIDIA Container Toolkit, run sudo nvidia-ctk runtime configure --runtime=docker, restart the Docker daemon, and confirm with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu24.04 nvidia-smi.

Production Tips for a Self-Hosted AI Video Generator Stack

Keep a prompt library in version control. Your best outputs come from iterated prompts, not iterated parameters. Commit them, review them in PRs, and reuse them across projects.

Queue every job. Do not hit the inference endpoint synchronously from a user-facing request. A single video takes minutes, not milliseconds, and you will pile up requests. Put Redis or BullMQ in front of the worker and return a job ID to the client.

Cache model weights outside the container. Mount the HuggingFace cache as a volume so rebuilding the image does not re-download tens of GB of checkpoints. Treat the model dir as immutable data, not code.

Log VRAM usage per run. Hook torch.cuda.max_memory_allocated() into your worker and log it. You will spot creeping memory leaks and regression from model updates long before they cause an outage.

Separate dev and prod GPUs. Do not run your ComfyUI playground and your production video worker on the same card. They will fight for VRAM, and the playground always wins because it holds references longer.

Spin Up Your First Self-Hosted Video in 10 Minutes

  1. Confirm your GPU has at least 24GB of free VRAM for Wan 2.2 TI2V-5B (or 8GB if you want to start with the Wan 2.1 1.3B fallback) with nvidia-smi
  2. Create the virtualenv and install PyTorch for CUDA 12.4 as shown above
  3. Run huggingface-cli download Wan-AI/Wan2.2-TI2V-5B into a local folder
  4. Clone Wan-Video/Wan2.2 and pip install -r requirements.txt
  5. Run the generate.py command with your own prompt and open the resulting mp4

If that works on your box, you already have a self-hosted AI video generator stack. Everything after this is just making it repeatable, queued, and cheap enough to run at the volume your product actually needs.

Sources and Further Reading

 

Prev Article
AI Image Generator Showdown: Stable Diffusion vs Flux vs ComfyUI 2026
Next Article
How to create Logos with Midjourney

Related to this topic: