Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Whisper - Speech Recognition

Run Whisper AI Locally: Transcribe Voice to Text in Minutes

OpenAI's Whisper runs entirely on your machine, no cloud, no API keys. This guide walks you through installing faster-whisper, picking the right model, and transcribing your first audio file in under 5 minutes.
2026-04-10
Updated 0

License MIT
TL;DR
  • MIT-licensed speech recognition, runs 100% offline on your hardware
  • faster-whisper gives 4x speed boost with lower VRAM via CTranslate2
  • Models from tiny (1GB VRAM) to large-v3 (10GB VRAM) for any hardware
System Requirements
RAM4GB
GPURTX 3060
VRAM2GB+
✓ Apple Silicon

Your voice recordings deserve better than uploading them to a cloud service, waiting, and hoping nobody else listens. OpenAI's Whisper is a speech recognition model that runs entirely on your machine. No API keys, no internet connection, no monthly bill. It supports 99 languages, handles accents well, and produces timestamped transcripts you can pipe into any workflow. This guide gets you from zero to transcription in under five minutes using faster-whisper, the fastest local backend available.

What Is Whisper (and Why Run It Locally?)

Whisper is an automatic speech recognition (ASR) model that OpenAI open-sourced in September 2022 under the MIT license. It was trained on 680,000 hours of multilingual audio scraped from the web, which makes it unusually robust against background noise, accents, and domain-specific vocabulary.

You can hit the Whisper API through OpenAI's cloud, but running it locally has real advantages. Privacy is the obvious one: medical dictations, legal depositions, personal voice journals, and internal meeting recordings should not leave your network. Beyond privacy, local inference means zero API costs, no rate limits, and it works on a plane. If you have a GPU (or even a decent CPU), you already own the hardware.

Pick Your Backend

There are three main ways to run Whisper on your own machine. Each makes different trade-offs between speed, ease of setup, and hardware support.

BackendSpeedInstallGPU SupportBest For
openai/whisper1x (baseline)pip installCUDASimplicity, reference impl
faster-whisper4x fasterpip installCUDA / CPUMost users (recommended)
whisper.cpp3-5x fasterCompile from sourceCPU / MetalApple Silicon, no Python

We recommend faster-whisper for most readers. It uses CTranslate2 under the hood, which means 4x faster inference and significantly lower memory usage compared to the original OpenAI implementation. It installs with a single pip command and works on both GPU and CPU. If you are on a Mac and want native Metal acceleration without Python, check out whisper.cpp instead.

Pick Your Model

Whisper ships in six sizes. Bigger models are more accurate but slower and hungrier for VRAM. Here is the full lineup:

ModelParametersVRAMRelative SpeedBest For
tiny39M~1 GBFastestQuick drafts, real-time notes
base74M~1 GBFastCasual transcription
small244M~2 GBModerateGood accuracy/speed balance
medium769M~5 GBSlowerProfessional transcription
large-v31.5B~10 GBSlowestMaximum accuracy
large-v3-turbo809M~6 GB4x faster than largeSpeed + accuracy sweet spot

Our pick: Use large-v3-turbo if you have a GPU with 6+ GB VRAM. It matches large-v3 accuracy in most scenarios at a fraction of the compute. On CPU-only machines, small gives the best balance between quality and wait time.

Install faster-whisper (Step by Step)

This works on Linux, macOS, and Windows. You need Python 3.9 or newer.

1. Create a virtual environment

python3 -m venv whisper-env
source whisper-env/bin/activate    # Linux/macOS
# whisper-env\Scripts\activate     # Windows

2. Install faster-whisper

pip install faster-whisper

That is it for CPU users. The package pulls in CTranslate2 automatically.

3. GPU users: verify CUDA

faster-whisper requires CUDA 12 and cuDNN 9 for GPU acceleration. Check your setup:

python3 -c "import ctranslate2; print(ctranslate2.get_cuda_device_count())"

If this prints 0 but you have an NVIDIA GPU, you likely need to install or update your CUDA toolkit. Check the faster-whisper GPU docs for details.

Transcribe Your First File

Here is the minimal script. Save it as transcribe.py and point it at any audio file (WAV, MP3, M4A, FLAC, OGG all work).

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="auto", compute_type="float16")
segments, info = model.transcribe("recording.mp3", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

Run it:

python3 transcribe.py

The first run downloads the model weights (about 1.5 GB for large-v3-turbo). Subsequent runs load from cache.

Save to SRT (subtitle format)

If you want subtitles instead of plain text, this version outputs a standard SRT file:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="auto", compute_type="float16")
segments, info = model.transcribe("recording.mp3", beam_size=5)

with open("output.srt", "w") as f:
    for i, seg in enumerate(segments, 1):
        start_h, start_m = divmod(int(seg.start), 3600)
        start_m, start_s = divmod(start_m, 60)
        end_h, end_m = divmod(int(seg.end), 3600)
        end_m, end_s = divmod(end_m, 60)
        f.write(f"{i}\n")
        f.write(f"{start_h:02d}:{start_m:02d}:{start_s:02d},000 --> ")
        f.write(f"{end_h:02d}:{end_m:02d}:{end_s:02d},000\n")
        f.write(f"{seg.text.strip()}\n\n")

print("Saved to output.srt")

Transcribe from the Command Line

If you prefer a one-liner over writing Python, the original openai/whisper package includes a CLI tool:

pip install openai-whisper
whisper recording.mp3 --model large-v3-turbo --output_format srt

This uses the original (slower) backend, but the convenience is hard to beat for quick jobs. For batch processing with faster-whisper, wrap the Python script in a simple loop:

for f in *.mp3; do
    python3 transcribe.py "$f"
done

Tips and Gotchas

Audio format: Whisper accepts WAV, MP3, M4A, FLAC, and OGG. For best results, use 16 kHz mono audio. Most recordings work fine without conversion, but if you get odd results, try converting first with ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.

Force the language: Whisper auto-detects the language, but you can force it for better accuracy on short clips: model.transcribe("file.mp3", language="en").

Low VRAM? Use INT8: If your GPU runs out of memory, switch to INT8 quantization. Change compute_type="float16" to compute_type="int8". This roughly halves VRAM usage with minimal accuracy loss.

CPU-only is fine: On a modern CPU (Intel i7/Ryzen 7 or better), the small model transcribes at roughly 2x real-time speed. A 10-minute recording takes about 5 minutes. Not instant, but perfectly usable.

Common error: CUDA out of memory means your model is too large for your GPU. Drop to a smaller model or switch to compute_type="int8" before giving up.

What You Can Build With This

Transcription is a building block. Here are four things you can wire up in an afternoon:

Meeting notes pipeline: Record your meetings, transcribe with Whisper, then feed the text into a local LLM (Ollama + Mistral Small) to generate summaries and action items. Fully offline, fully private.

Podcast search index: Transcribe your podcast backlog, index the text with a simple full-text search, and find that one guest quote from episode 47 in seconds.

Subtitle generation: The SRT output script above is a complete subtitle pipeline. Drop the SRT file into any video editor or player.

Voice journal: Record a quick voice memo each morning, auto-transcribe it with a cron job, and append to a daily text log. Search your own thoughts.

You now have a private, offline transcription pipeline running on your own hardware. Try it on your next meeting recording or that pile of interview audio you have been meaning to process. If faster-whisper handles your workload, you never need to send audio to the cloud again.

Prev Article
Mastering ChatGPT: Your Step-by-Step Guide to Smarter AI Conversations
Next Article
Automate MODX with Claude Code: Full Site Control Over SSH

Related to this topic:

No related pages found.