Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Miso Labs - Text-to-Speech

MisoTTS

Miso Labs' MisoTTS is an 8B open voice model that clones a speaker from ten seconds of audio, runs on a 24GB GPU, and ships under a license you can build a product on. English-only and unverified on quality, but genuinely open.

License Modified MIT
License Modified MIT
TL;DR
  • Miso Labs' MisoTTS: an 8B open-weight text-to-speech model with one-shot voice cloning from about ten seconds of audio.
  • Runs on a single 24GB GPU; ships under a Modified MIT license that allows commercial use.
  • English only, output watermarked, and quality is vendor-claimed with no independent benchmarks yet.
System Requirements
RAM24GB+
GPURTX 3090/4090 (24GB)
VRAM24GB (bf16)

Open text-to-speech has a quality gap and a license gap. Miso Labs went after both on June 3, 2026, with MisoTTS: an 8-billion-parameter voice model that clones a speaker from about ten seconds of audio, runs on a single 24GB GPU, and ships under a license you can actually build a product on. It is English-only and its quality claims are still unverified by anyone but Miso, but as an open, commercial-friendly voice model you can self-host, it earns a look. Here is the honest rundown.

What MisoTTS is

MisoTTS (the open-weights name for the model Miso also sells as "Miso One") generates speech from text, with an optional reference clip for one-shot voice cloning. Architecturally it follows the Sesame CSM lineage: a large Llama-3.2-style transformer backbone predicts audio over time, feeding a smaller decoder that fills in the finer detail, using the Mimi neural audio codec. It is 8B parameters total, handles up to 2,048 tokens of context, and watermarks its output by default. One honest limit up front: it speaks English only.

The license is the quiet win

Most open voice models arrive with a research-only or non-commercial license, which makes them useless for a product. MisoTTS ships under a modified MIT license that grants the full commercial rights you expect (use, modify, sell), with a single added condition: if your product crosses 50 million monthly users or 10 million dollars a month in revenue, you must display "Miso Labs" in the interface. For almost every builder, that is effectively unrestricted, and it is the most important thing about this release. Compare it to the gated weights of an image model like Ideogram 4: same "open weights" label, very different freedom.

What about quality?

This is where you should keep your skepticism. Miso published exactly one number: latency, claimed at about 110 milliseconds, faster than a human's reaction time and well under the roughly 700 milliseconds it attributes to a closed competitor. There are no MOS scores (the standard human-rated naturalness metric), no word-error-rate numbers, and no independent A/B tests against Orpheus, CSM, or Fish Audio. The architecture is sound and the lineage is respectable, but "sounds great" is, for now, a vendor claim. Test it on your own voices before you trust it.

Limitations and gotchas

  • English only at launch. No multilingual support.
  • Quality is unverified. The only published metric is latency; naturalness and accuracy have no independent numbers yet.
  • Output is watermarked by default (via SilentCipher), which matters if you plan to post-process the audio.
  • It models single turns, not live conversation; there is no turn-taking or full-duplex mode.
  • No documented Apple Silicon path; plan for an NVIDIA GPU.

Who should use it

Builders who need a self-hostable, commercially licensed English TTS with voice cloning, and who can run a 24GB GPU: think narration, dubbing prototypes, voice agents, and accessibility tools where sending audio to a paid API is a cost or privacy problem. If you need many languages, or you need proven naturalness scores today, look elsewhere for now. The pitch is freedom and control, with quality you verify yourself.

Run it in about 10 minutes

It needs roughly 24GB of VRAM in bf16 (an RTX 3090 or 4090 will do). The repo ships its own runner.

# Install uv, clone, and run the included demo
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10 && source .venv/bin/activate
uv run python run_misotts.py

For voice cloning, pass a short reference clip alongside your text and MisoTTS matches its register and pacing:

# Sketch of the cloning call (see the repo README for exact arguments)
from misotts import MisoTTS

tts = MisoTTS.from_pretrained("MisoLabs/MisoTTS")
audio = tts.generate(
    text="Ship it on Friday, not before the tests pass.",
    speaker_audio="reference_10s.wav",
)
audio.save("out.wav")

If you would rather not set up a GPU, there is a community demo Space on Hugging Face. Throw a tricky sentence and your own voice at it, then judge the naturalness yourself, because right now nobody else has.

Sources and further reading

Tested on: not independently tested in our environment. MisoTTS is reported to run on a single 24GB GPU in bf16; the only published quality metric is a vendor latency claim (about 110 ms), with no independent MOS or word-error-rate numbers as of this writing. English-only, with watermarked output.
Date checked: 2026-06-26

Prev Article
Cosmos 3
Next Article
OpenThinker-32B

Related to this topic: