The Hugging Face Smol Stack in 2026: SmolLM3 + SmolVLM2 Local AI Without the Cloud

License Apache 2.0

TL;DR

SmolLM3-3B for reasoning (128k context, dual-mode think/no-think, 6 languages) paired with SmolVLM2-2.2B for video and image understanding. Both Apache 2.0 from Hugging Face.
Runs locally on consumer GPU, Apple Silicon via MLX, and modern phones (HuggingSnap on iOS, SmolChat on Android). No cloud, no API keys.
No SmolVLM3 or SmolLM4 has shipped as of April 2026. This pair remains the flagship fully open Smol stack.

System Requirements

RAM	4 GB (phone) / 16 GB (desktop)
GPU	RTX 3060 12 GB or better
VRAM	8-12 GB (stack combined)

✓ Ollama ✓ Apple Silicon

Table of Contents

Hugging Face's Smol Models Research team has been quietly building something that most release cycles glossed over: a matched pair of fully open, Apache-2.0 models that give you long-context reasoning and video understanding on hardware you already own. SmolLM3-3B handles the thinking. SmolVLM2-2.2B handles the seeing. Together, with zero cloud calls, they fit inside 12 GB of VRAM, run on an M-series Mac, and have already shipped to iPhones as a public App Store app.

Neither model is new. SmolVLM2-2.2B landed in February 2025. SmolLM3-3B followed in July 2025. What matters in April 2026 is that nothing has replaced them. There is no SmolVLM3. There is no SmolLM4. Meanwhile Qwen3.5 and Gemma 4 have raised the bar on the closed-weights-with-open-license side of the small-model race, and the Smol pair remains the most deployable fully open option for builders who want both text and vision locally. This article is the stack guide: what each half does, how to run them on a GPU, a Mac, and a phone, and how to glue them together in one Python file.

Why these two belong in the same article

They come from the same lab (HuggingFaceTB, the Hugging Face Smol Models Research group), share the same Apache 2.0 license, share the same design philosophy (run anywhere, publish everything), and cover complementary modalities. SmolVLM2 is even literally built on SmolLM2-1.7B as its text decoder. The Smol repo treats them as one family.

One open question worth flagging: SmolVLM2 still uses SmolLM2 as its text backbone, not SmolLM3. No official SmolVLM3 has shipped. For anyone deploying today, that's fine. For anyone watching the roadmap, it's the obvious next move and it hasn't happened yet.

SmolLM3-3B in practice

SmolLM3 is a 3B decoder-only transformer with a few deliberate architecture choices. It uses GQA (grouped-query attention) with 16 heads and 4 key/value heads, and NoPE (no positional embeddings) on a 3-to-1 ratio with standard RoPE layers. That combination is what lets it hit 64k native context and extrapolate to 128k via YARN without falling apart. Pretraining ran 11.2 trillion tokens through a staged curriculum: web, code, math, and reasoning data, in that order of weight.

The feature set is short but unusual for the 3B class: dual-mode reasoning (think / no-think, toggled via the chat template), six natively supported languages (English, French, Spanish, German, Italian, Portuguese), and function-calling training baked in. Hugging Face-reported benchmarks put it ahead of Llama-3.2-3B and Qwen2.5-3B at 3B scale, and within range of Qwen3-4B and Gemma 3 4B on several tasks despite the smaller parameter count.

Run it on a consumer GPU

pip install -U transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain NoPE in one paragraph."}]
inputs = tok.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=512)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Set enable_thinking=False for fast, non-reasoning replies. BF16 with a 3B model fits comfortably under 8 GB of VRAM.

Run it with llama.cpp or Ollama

# GGUF via llama.cpp, Q4_K_M is the sweet spot for CPU or 6 GB GPU
huggingface-cli download ggml-org/SmolLM3-3B-GGUF SmolLM3-3B-Q4_K_M.gguf --local-dir .
./llama-cli -m SmolLM3-3B-Q4_K_M.gguf -p "Why is NoPE useful at long context?"

# Ollama (community mirror, not an official HuggingFaceTB tag)
ollama pull alibayram/smollm3:latest
ollama run alibayram/smollm3

The alibayram/smollm3 Ollama tag is a community mirror. It is not published by Hugging Face. Use it if you want one-line install, but prefer the GGUF path if you care about provenance.

Run it on Apple Silicon

pip install mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/SmolLM3-3B \
  --prompt "Summarize what NoPE buys you at 64k context." \
  --max-tokens 512

MLX runs noticeably faster than llama.cpp on M-series chips for this size class. A 16 GB unified-memory Mac handles the full BF16 weights comfortably.

SmolVLM2-2.2B in practice

SmolVLM2-2.2B is a video and image VLM built on two components: SigLIP for the visual encoder and SmolLM2-1.7B-Instruct as the text decoder. It samples video at 1 frame per second with a hard cap of 64 frames per clip, so it is a short-clip tool out of the box (up to about one minute at native sampling). Longer videos need chunking.

Hugging Face-reported benchmarks: Video-MME 52.1, MLVU 55.2, MVBench 46.27, Science_QA 90.0, and OCR 72.9. For context, Video-MME at 52.1 puts it in the same range as much larger VLMs released in 2024, and the OCR number makes it genuinely useful for screenshot parsing and document reading, not just dataset curiosity. GPU RAM for video inference sits around 5.2 GB.

Run it on a consumer GPU

pip install "git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2"
pip install accelerate flash-attn --no-build-isolation

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
).to("cuda")

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "path": "clip.mp4"},
        {"type": "text",  "text": "List the key events in this clip with rough timestamps."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

out = model.generate(**inputs, do_sample=False, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Run it on Apple Silicon

pip install "git+https://github.com/pcuenca/mlx-vlm.git@smolvlm"

# Image prompt
python -m mlx_vlm.generate \
  --model mlx-community/SmolVLM2-2.2B-Instruct-mlx \
  --image ./photo.jpg \
  --prompt "Describe what you see."

# Video prompt (500M is the right size for phones and tablets)
python -m mlx_vlm.smolvlm_video_generate \
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
  --prompt "What happens in this clip?" \
  --video ./clip.mov

Run it on a phone

This is the part that surprised us. Hugging Face shipped HuggingSnap on the iOS App Store, a native Swift app built on MLX that runs SmolVLM2-500M entirely on-device. Point the camera, ask a question, get an answer. Source is on GitHub at huggingface/huggingsnap. On Android, SmolChat-Android (shubham0204/SmolChat-Android) runs GGUF builds of both SmolVLM2 and SmolLM3 through a llama.cpp bridge. Community reports confirm the SmolVLM2-256M GGUF variant runs smoothly on Snapdragon 8 Gen 3 and Snapdragon 8 Elite devices.

For most phone deployments you will want the 500M or 256M SmolVLM2 variant, not the 2.2B. The 500M is the model HuggingSnap actually ships. The 2.2B is for desktop, Mac, and server.

The combined stack: a concrete recipe

The interesting workflow is not running either model in isolation. It is chaining them. SmolVLM2 extracts structured content from a video or image. SmolLM3 reasons over that extracted content in think mode and produces an answer you can act on. Nothing leaves the machine.

import torch
from transformers import (
    AutoProcessor, AutoModelForImageTextToText,
    AutoTokenizer, AutoModelForCausalLM,
)

# Step 1: vision model sees the video
vlm_id = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
vlm_proc = AutoProcessor.from_pretrained(vlm_id)
vlm = AutoModelForImageTextToText.from_pretrained(
    vlm_id, torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
).to("cuda")

vlm_msg = [{"role": "user", "content": [
    {"type": "video", "path": "meeting.mp4"},
    {"type": "text",  "text": "Transcribe spoken content and describe on-screen visuals with timestamps."},
]}]
vlm_inputs = vlm_proc.apply_chat_template(
    vlm_msg, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(vlm.device, dtype=torch.bfloat16)
vlm_out = vlm.generate(**vlm_inputs, do_sample=False, max_new_tokens=1024)
video_notes = vlm_proc.batch_decode(vlm_out, skip_special_tokens=True)[0]

del vlm  # free VRAM before loading the LLM
torch.cuda.empty_cache()

# Step 2: reasoning model acts on what the VLM saw
llm_id = "HuggingFaceTB/SmolLM3-3B"
tok = AutoTokenizer.from_pretrained(llm_id)
llm = AutoModelForCausalLM.from_pretrained(
    llm_id, torch_dtype=torch.bfloat16, device_map="auto",
)

task = (
    "Below are notes from a video. Extract a list of action items with owners "
    "and deadlines. If an item lacks an owner or deadline, flag it.\n\n"
    f"{video_notes}"
)
llm_inputs = tok.apply_chat_template(
    [{"role": "user", "content": task}],
    add_generation_prompt=True, return_tensors="pt", enable_thinking=True,
).to(llm.device)
llm_out = llm.generate(llm_inputs, max_new_tokens=1024)
print(tok.decode(llm_out[0][llm_inputs.shape[-1]:], skip_special_tokens=True))

Two details worth knowing. First, loading both models into VRAM at once costs roughly 5.2 GB (VLM) plus 6 GB (LLM in BF16), so an 8 GB card needs sequential loading, as shown above. On a 12 GB card you can keep both resident. Second, enable_thinking=True turns on SmolLM3's think mode for the reasoning step, which is what makes the "extract action items" task actually work well at 3B scale.

Deployment matrix

Platform	SmolLM3 path	SmolVLM2 path	RAM target
Consumer GPU (RTX 3060 and up)	Transformers BF16, or Q4_K_M GGUF via llama.cpp	Transformers BF16 with flash-attn 2	8-12 GB VRAM
Apple Silicon (M1 and up)	`mlx-lm` with `mlx-community/SmolLM3-3B`	`mlx-vlm` with SmolVLM2-500M or 2.2B MLX build	16 GB unified memory
Phone (Android / iOS)	Q4 GGUF via SmolChat-Android (Android) or on-device llama.cpp port	HuggingSnap (iOS, 500M) or SmolChat-Android (256M or 500M GGUF)	4-8 GB phone RAM

Limitations and gotchas

SmolVLM2 caps video input at 64 frames. At 1 FPS sampling that's about a minute of footage. Longer clips need chunking in your application code.
SmolLM3's 128k context is YARN extrapolation from a 64k native window. Quality holds through 64k and degrades beyond. Treat 64k as the honest ceiling.
2.2B VLMs describe and OCR well but will not match GPT-4o or Gemini 2.5 Flash on fine-grained reasoning over complex scenes. Use them where privacy or offline is the point, not where you need frontier multimodal reasoning.
The alibayram/smollm3 Ollama tag is community-maintained. There is no official HuggingFaceTB Ollama tag at the time of writing.
SmolVLM2 is still anchored on SmolLM2-1.7B as its text decoder. A SmolVLM3 built on the newer reasoner would be the natural next release. It has not shipped yet.

Who should use this

Developers building privacy-first tools: on-device assistants, meeting recorders, local note apps, accessibility utilities.
Indie hackers who want offline-capable mobile apps without paying per-token inference fees.
RAG and agent builders who need a local fallback when an API key is rate-limited, blocked, or simply unavailable in the field.
Researchers with limited compute who value reproducibility from fully open weights and published training details.

Sources and further reading

If you only do one thing after reading this: pull the SmolLM3-3B Q4_K_M GGUF into llama.cpp, run the think-mode prompt from the SmolLM3 section, and confirm for yourself that a top-of-class 3B reasoner runs locally on the hardware you already own. Ten minutes, no account, no key.

Benchmarks cited are HuggingFace-reported or community-reported (linked above); this article did not run independent benchmarks.

Subscribe to the Newsletter

Search

GDPR Compliance

Log in

Create an account

Reset password

Terms of use

Information Collected by SingularityByte.com

How We Use This Information

Information Disclosure

Cookies, Trackers, and Online Ads

Other Sites

Information Security

Do-Not-Track

Additional Options

Microsoft Clarity

Contact Us

Midjourney SREF Styles:

Why these two belong in the same article

SmolLM3-3B in practice

Run it on a consumer GPU

Run it with llama.cpp or Ollama

Run it on Apple Silicon

SmolVLM2-2.2B in practice

Run it on a consumer GPU

Run it on Apple Silicon

Run it on a phone

The combined stack: a concrete recipe

Deployment matrix

Limitations and gotchas

Who should use this

Sources and further reading

Build Your Own AI Checker With Open Source Models 2026

SLMs on Microcontrollers: Does Edge AI Fit on a Sensor Node?

Related to this topic:

Latest topics

The Sections

About

Keep up to date with the latest updates & news