Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Hugging Face - Multi-Modal

Hugging Face Smol Stack: SmolLM3 + SmolVLM2 Local AI in 2026

Deploy the Hugging Face Smol Stack on GPU, Apple Silicon, and phones. SmolLM3-3B reasoning plus SmolVLM2-2.2B video understanding, Apache 2.0, fully local.

License Apache 2.0
TL;DR
  • SmolLM3-3B for reasoning (128k context, dual-mode think/no-think, 6 languages) paired with SmolVLM2-2.2B for video and image understanding. Both Apache 2.0 from Hugging Face.
  • Runs locally on consumer GPU, Apple Silicon via MLX, and modern phones (HuggingSnap on iOS, SmolChat on Android). No cloud, no API keys.
  • No SmolVLM3 or SmolLM4 has shipped as of April 2026. This pair remains the flagship fully open Smol stack.
System Requirements
RAM4 GB (phone) / 16 GB (desktop)
GPURTX 3060 12 GB or better
VRAM8-12 GB (stack combined)
✓ Ollama ✓ Apple Silicon

Hugging Face's Smol Models Research team has been quietly building something that most release cycles glossed over: a matched pair of fully open, Apache-2.0 models that give you long-context reasoning and video understanding on hardware you already own. SmolLM3-3B handles the thinking. SmolVLM2-2.2B handles the seeing. Together, with zero cloud calls, they fit inside 12 GB of VRAM, run on an M-series Mac, and have already shipped to iPhones as a public App Store app.

Neither model is new. SmolVLM2-2.2B landed in February 2025. SmolLM3-3B followed in July 2025. What matters in April 2026 is that nothing has replaced them. There is no SmolVLM3. There is no SmolLM4. Meanwhile Qwen3.5 and Gemma 4 have raised the bar on the closed-weights-with-open-license side of the small-model race, and the Smol pair remains the most deployable fully open option for builders who want both text and vision locally. This article is the stack guide: what each half does, how to run them on a GPU, a Mac, and a phone, and how to glue them together in one Python file.

Why these two belong in the same article

They come from the same lab (HuggingFaceTB, the Hugging Face Smol Models Research group), share the same Apache 2.0 license, share the same design philosophy (run anywhere, publish everything), and cover complementary modalities. SmolVLM2 is even literally built on SmolLM2-1.7B as its text decoder. The Smol repo treats them as one family.

One open question worth flagging: SmolVLM2 still uses SmolLM2 as its text backbone, not SmolLM3. No official SmolVLM3 has shipped. For anyone deploying today, that's fine. For anyone watching the roadmap, it's the obvious next move and it hasn't happened yet.

SmolLM3-3B in practice

SmolLM3 is a 3B decoder-only transformer with a few deliberate architecture choices. It uses GQA (grouped-query attention) with 16 heads and 4 key/value heads, and NoPE (no positional embeddings) on a 3-to-1 ratio with standard RoPE layers. That combination is what lets it hit 64k native context and extrapolate to 128k via YARN without falling apart. Pretraining ran 11.2 trillion tokens through a staged curriculum: web, code, math, and reasoning data, in that order of weight.

The feature set is short but unusual for the 3B class: dual-mode reasoning (think / no-think, toggled via the chat template), six natively supported languages (English, French, Spanish, German, Italian, Portuguese), and function-calling training baked in. Hugging Face-reported benchmarks put it ahead of Llama-3.2-3B and Qwen2.5-3B at 3B scale, and within range of Qwen3-4B and Gemma 3 4B on several tasks despite the smaller parameter count.

Run it on a consumer GPU

pip install -U transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain NoPE in one paragraph."}]
inputs = tok.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=512)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Set enable_thinking=False for fast, non-reasoning replies. BF16 with a 3B model fits comfortably under 8 GB of VRAM.

Run it with llama.cpp or Ollama

# GGUF via llama.cpp, Q4_K_M is the sweet spot for CPU or 6 GB GPU
huggingface-cli download ggml-org/SmolLM3-3B-GGUF SmolLM3-3B-Q4_K_M.gguf --local-dir .
./llama-cli -m SmolLM3-3B-Q4_K_M.gguf -p "Why is NoPE useful at long context?"

# Ollama (community mirror, not an official HuggingFaceTB tag)
ollama pull alibayram/smollm3:latest
ollama run alibayram/smollm3

The alibayram/smollm3 Ollama tag is a community mirror. It is not published by Hugging Face. Use it if you want one-line install, but prefer the GGUF path if you care about provenance.

Run it on Apple Silicon

pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/SmolLM3-3B \
  --prompt "Summarize what NoPE buys you at 64k context." \
  --max-tokens 512

MLX runs noticeably faster than llama.cpp on M-series chips for this size class. A 16 GB unified-memory Mac handles the full BF16 weights comfortably.

SmolVLM2-2.2B in practice

SmolVLM2-2.2B is a video and image VLM built on two components: SigLIP for the visual encoder and SmolLM2-1.7B-Instruct as the text decoder. It samples video at 1 frame per second with a hard cap of 64 frames per clip, so it is a short-clip tool out of the box (up to about one minute at native sampling). Longer videos need chunking.

Hugging Face-reported benchmarks: Video-MME 52.1, MLVU 55.2, MVBench 46.27, Science_QA 90.0, and OCR 72.9. For context, Video-MME at 52.1 puts it in the same range as much larger VLMs released in 2024, and the OCR number makes it genuinely useful for screenshot parsing and document reading, not just dataset curiosity. GPU RAM for video inference sits around 5.2 GB.

Run it on a consumer GPU

pip install "git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2"
pip install accelerate flash-attn --no-build-isolation
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
).to("cuda")

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "path": "clip.mp4"},
        {"type": "text",  "text": "List the key events in this clip with rough timestamps."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

out = model.generate(**inputs, do_sample=False, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Run it on Apple Silicon

pip install "git+https://github.com/pcuenca/mlx-vlm.git@smolvlm"

# Image prompt
python -m mlx_vlm.generate \
  --model mlx-community/SmolVLM2-2.2B-Instruct-mlx \
  --image ./photo.jpg \
  --prompt "Describe what you see."

# Video prompt (500M is the right size for phones and tablets)
python -m mlx_vlm.smolvlm_video_generate \
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
  --prompt "What happens in this clip?" \
  --video ./clip.mov

Run it on a phone

This is the part that surprised us. Hugging Face shipped HuggingSnap on the iOS App Store, a native Swift app built on MLX that runs SmolVLM2-500M entirely on-device. Point the camera, ask a question, get an answer. Source is on GitHub at huggingface/huggingsnap. On Android, SmolChat-Android (shubham0204/SmolChat-Android) runs GGUF builds of both SmolVLM2 and SmolLM3 through a llama.cpp bridge. Community reports confirm the SmolVLM2-256M GGUF variant runs smoothly on Snapdragon 8 Gen 3 and Snapdragon 8 Elite devices.

For most phone deployments you will want the 500M or 256M SmolVLM2 variant, not the 2.2B. The 500M is the model HuggingSnap actually ships. The 2.2B is for desktop, Mac, and server.

The combined stack: a concrete recipe

The interesting workflow is not running either model in isolation. It is chaining them. SmolVLM2 extracts structured content from a video or image. SmolLM3 reasons over that extracted content in think mode and produces an answer you can act on. Nothing leaves the machine.

import torch
from transformers import (
    AutoProcessor, AutoModelForImageTextToText,
    AutoTokenizer, AutoModelForCausalLM,
)

# Step 1: vision model sees the video
vlm_id = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
vlm_proc = AutoProcessor.from_pretrained(vlm_id)
vlm = AutoModelForImageTextToText.from_pretrained(
    vlm_id, torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
).to("cuda")

vlm_msg = [{"role": "user", "content": [
    {"type": "video", "path": "meeting.mp4"},
    {"type": "text",  "text": "Transcribe spoken content and describe on-screen visuals with timestamps."},
]}]
vlm_inputs = vlm_proc.apply_chat_template(
    vlm_msg, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(vlm.device, dtype=torch.bfloat16)
vlm_out = vlm.generate(**vlm_inputs, do_sample=False, max_new_tokens=1024)
video_notes = vlm_proc.batch_decode(vlm_out, skip_special_tokens=True)[0]

del vlm  # free VRAM before loading the LLM
torch.cuda.empty_cache()

# Step 2: reasoning model acts on what the VLM saw
llm_id = "HuggingFaceTB/SmolLM3-3B"
tok = AutoTokenizer.from_pretrained(llm_id)
llm = AutoModelForCausalLM.from_pretrained(
    llm_id, torch_dtype=torch.bfloat16, device_map="auto",
)

task = (
    "Below are notes from a video. Extract a list of action items with owners "
    "and deadlines. If an item lacks an owner or deadline, flag it.\n\n"
    f"{video_notes}"
)
llm_inputs = tok.apply_chat_template(
    [{"role": "user", "content": task}],
    add_generation_prompt=True, return_tensors="pt", enable_thinking=True,
).to(llm.device)
llm_out = llm.generate(llm_inputs, max_new_tokens=1024)
print(tok.decode(llm_out[0][llm_inputs.shape[-1]:], skip_special_tokens=True))

Two details worth knowing. First, loading both models into VRAM at once costs roughly 5.2 GB (VLM) plus 6 GB (LLM in BF16), so an 8 GB card needs sequential loading, as shown above. On a 12 GB card you can keep both resident. Second, enable_thinking=True turns on SmolLM3's think mode for the reasoning step, which is what makes the "extract action items" task actually work well at 3B scale.

Deployment matrix

PlatformSmolLM3 pathSmolVLM2 pathRAM target
Consumer GPU (RTX 3060 and up) Transformers BF16, or Q4_K_M GGUF via llama.cpp Transformers BF16 with flash-attn 2 8-12 GB VRAM
Apple Silicon (M1 and up) mlx-lm with mlx-community/SmolLM3-3B mlx-vlm with SmolVLM2-500M or 2.2B MLX build 16 GB unified memory
Phone (Android / iOS) Q4 GGUF via SmolChat-Android (Android) or on-device llama.cpp port HuggingSnap (iOS, 500M) or SmolChat-Android (256M or 500M GGUF) 4-8 GB phone RAM

Limitations and gotchas

  • SmolVLM2 caps video input at 64 frames. At 1 FPS sampling that's about a minute of footage. Longer clips need chunking in your application code.
  • SmolLM3's 128k context is YARN extrapolation from a 64k native window. Quality holds through 64k and degrades beyond. Treat 64k as the honest ceiling.
  • 2.2B VLMs describe and OCR well but will not match GPT-4o or Gemini 2.5 Flash on fine-grained reasoning over complex scenes. Use them where privacy or offline is the point, not where you need frontier multimodal reasoning.
  • The alibayram/smollm3 Ollama tag is community-maintained. There is no official HuggingFaceTB Ollama tag at the time of writing.
  • SmolVLM2 is still anchored on SmolLM2-1.7B as its text decoder. A SmolVLM3 built on the newer reasoner would be the natural next release. It has not shipped yet.

Who should use this

  • Developers building privacy-first tools: on-device assistants, meeting recorders, local note apps, accessibility utilities.
  • Indie hackers who want offline-capable mobile apps without paying per-token inference fees.
  • RAG and agent builders who need a local fallback when an API key is rate-limited, blocked, or simply unavailable in the field.
  • Researchers with limited compute who value reproducibility from fully open weights and published training details.

Sources and further reading

If you only do one thing after reading this: pull the SmolLM3-3B Q4_K_M GGUF into llama.cpp, run the think-mode prompt from the SmolLM3 section, and confirm for yourself that a top-of-class 3B reasoner runs locally on the hardware you already own. Ten minutes, no account, no key.

Benchmarks cited are HuggingFace-reported or community-reported (linked above); this article did not run independent benchmarks.

Prev Article
Build Your Own AI Checker With Open Source Models 2026
Next Article
How to create Logos with Midjourney

Related to this topic: