Tencent HY-MT1.5: 33-Language Open-Source Translation Family From 7B to 440MB

License Other

TL;DR

33 languages + 5 dialects, 1,056 directions
Variants from 7B BF16 down to a 440MB GGUF
Beats Microsoft Translator and Doubao on FLORES-200

System Requirements

RAM	1GB
GPU	Optional
VRAM	0GB

✓ Ollama ✓ Apple Silicon

Table of Contents

Tencent quietly opened a translation race nobody on the open-source side was equipped to enter. The HY-MT1.5 family covers everything from a 7B server model down to a 440MB GGUF that runs on a phone, and the small ones beat most commercial APIs at Chinese-foreign translation. If you build agents, RAG pipelines, or apps that touch more than one language, this family deserves a closer look this week.

What HY-MT1.5 actually is

HY-MT1.5 is Tencent Hunyuan's open-source machine-translation family. The base release on 2025-12-30 shipped two checkpoints, HY-MT1.5-7B and HY-MT1.5-1.8B, both trained through a four-stage pipeline of MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. Both cover 33 languages and 5 ethnic or regional variants, which works out to 1,056 supported translation directions.

On 2026-04-29 Tencent extended the lineup downward with two extreme quantizations of the 1.8B model: a 2-bit GGUF at 574MB and a 1.25-bit GGUF at 440MB. Those two are the on-device variants that run in roughly 1GB of RAM, and they ship with an Android demo APK so you can try them without writing a line of code.

The full quantization family at a glance

Here is every public variant we are aware of, with the file size you actually download and the deployment niche each one targets.

Variant	Format	Size	Approx. RAM	Best for
HY-MT1.5-7B	BF16 weights	14.2GB	16GB+ GPU	Server-side, highest ceiling
HY-MT1.5-7B-FP8	FP8 weights	7.5GB	10GB GPU	Cost-tuned cloud serving
HY-MT1.5-7B-GPTQ-Int4	Int4 weights	4.8GB	6GB GPU	Single-GPU laptop inference
HY-MT1.5-1.8B	BF16 weights	3.3GB	6GB GPU or 8GB RAM	Default edge-server pick
HY-MT1.5-1.8B-FP8	FP8 weights	1.9GB	4GB GPU	Compact cloud worker
HY-MT1.5-1.8B-GGUF Q8_0	GGUF 8-bit	1.91GB	3GB	llama.cpp/Ollama desktops
HY-MT1.5-1.8B-GGUF Q6_K	GGUF 6-bit	1.47GB	2GB	Quality-balanced local inference
HY-MT1.5-1.8B-GGUF Q4_K_M	GGUF 4-bit	1.13GB	2GB	Mid-range laptops, Raspberry Pi 5
Hy-MT1.5-1.8B-2bit	SEQ 2-bit GGUF	574MB	~1GB	Mid-tier phones, Apple Silicon
Hy-MT1.5-1.8B-1.25bit	SEQ 1.25-bit GGUF	440MB	~1GB	Low-RAM phones, embedded

Everything published lives under the tencent organization on Hugging Face, with mirrored quantizations under the AngelSlim org for the SEQ variants.

Why the 2-bit and 1.25-bit variants are interesting

Most teams quantize by clamping a normal distribution and accepting some quality loss. Tencent's AngelSlim group went a different route called Stretched Elastic Quantization, or SEQ. Weights are projected onto the four-value codebook {-1.5, -0.5, 0.5, 1.5}, which sounds aggressive because it is. The trick is pairing it with quantization-aware distillation so the smaller student matches the BF16 teacher token by token during the squeeze. The published claim is "near-lossless translation quality" against the BF16 baseline.

Two consequences matter for builders:

The 1.25-bit variant fits inside a 1GB RAM budget, which is the floor for many Android handsets sold in emerging markets.
The same SEQ recipe maps cleanly to Arm SME2 instructions, so flagship Apple Silicon and vivo x300 hardware get a real speed bump rather than a generic Q-format fallback.

Tencent reports an average 0.18 second response for Chinese inputs around 50 tokens on the quantized 1.8B. That is well inside the latency budget for live chat translation, AR captions, or call overlays.

Hands-on: pick a variant in five minutes

The GGUF builds run in any current llama.cpp or Ollama setup. The chat template is custom, so use the Tencent-published prompt rather than the generic instruct format.

Run it under llama.cpp

llama-cli -hf tencent/HY-MT1.5-1.8B-GGUF:Q8_0 \
  -p "Translate the following segment into Chinese, without additional explanation.\n\nIt's on the house." \
  -n 4096 --temp 0.7 --top-k 20 --top-p 0.6 --repeat-penalty 1.05 --no-warmup

Run it under Ollama

Ollama needs the right TEMPLATE because the model uses Tencent's custom delimiters. Save this Modelfile and run it once.

FROM hf.co/tencent/HY-MT1.5-1.8B-GGUF:Q8_0
TEMPLATE """<|hy_begin_of_sentence|>{{ if .System }}{{ .System }}<|hy_place_holder_no_3|>{{ end }}{{ if .Prompt }}<|hy_User|>{{ .Prompt }}{{ end }}<|hy_Assistant|>"""

ollama create hy-mt1.5 -f Modelfile
ollama run hy-mt1.5 "Translate the following segment into German, without additional explanation.\n\nLong-tail keywords are the secret sauce of niche SEO."

Run the 2-bit on a phone

Tencent ships a prebuilt Android APK that wraps the 1.25-bit GGUF, so you do not need a native llama.cpp build to demo it. Sideload the APK from the AngelSlim release page, point the picker at the downloaded model file, and pick source and target languages. The tested handsets are a Snapdragon 865 with 8GB RAM and a Snapdragon 7+ Gen 2 with 16GB RAM, both delivered sub-second responses for short sentences.

Recommended sampling defaults

{
  "top_k": 20,
  "top_p": 0.6,
  "repetition_penalty": 1.05,
  "temperature": 0.7
}

Prompt patterns Tencent supports out of the box

The training corpus includes four prompt shapes. Use the one that matches your task or the model will sometimes echo the source text.

Plain translation: Translate the following segment into {target_language}, without additional explanation.
Terminology-locked: include a glossary line such as Reference: "RAG" should be translated as "检索增强生成" before the source text.
Document-context: prepend the surrounding paragraph and tell the model not to translate the context, only the highlighted segment.
Format-preserving: wrap the source in <source> tags with inline <sn> markers; the model returns a <target> block with the same structural tags in place.

The format-preserving mode is the one to remember if you are localizing UI strings or HTML fragments. It removes the post-processing step where most home-grown MT pipelines lose tags.

Benchmarks: how it stacks up

The headline number from the technical report is that HY-MT1.5-1.8B reaches roughly the 90th percentile of Gemini-3.0-Pro on the Flores-200 Chinese-foreign benchmark, which is the closed reference Tencent uses for translation quality. Against open-source competitors and commercial APIs, the same model ranks above Tower-Plus-72B, Qwen3-32B, Microsoft Translator, and Doubao Translator on the Chinese-foreign split.

Two qualifiers matter:

The Flores-200 split they highlight is Chinese-foreign, where Tencent has the most training signal. Expect a smaller gap on Latin script pairs.
The lineage matters. The previous Hunyuan-MT generation had already outperformed Google Translate in 30 of 31 evaluated language pairs at WMT25. HY-MT1.5 widens the lead while shrinking the model.

If you want the exact tables, the HY-MT1.5 technical report on arXiv is the right primary source.

The 33 languages and 5 dialects

Officially supported languages cover the major commercial corridors plus several South Asian and Southeast Asian scripts that most open MT models still drop:

Chinese, English, French, Portuguese, Spanish, Japanese, Turkish, Russian, Arabic, Korean, Thai, Italian, German, Vietnamese, Malay, Indonesian, Filipino, Hindi, Polish, Czech, Dutch, Khmer, Burmese, Persian, Gujarati, Urdu, Telugu, Marathi, Hebrew, Bengali, Tamil, Ukrainian.

The five dialect or regional variants extend coverage into Traditional Chinese, Cantonese, Tibetan, Kazakh, Mongolian, and Uyghur. Cantonese in particular is rare in production MT and useful for Hong Kong, Macau, and southern Chinese audiences.

Limitations and gotchas

Custom chat template. The Hugging Face GGUF page documents this, but the default Ollama Modelfile produces garbage output ("onse" loops) until you replace the TEMPLATE with the snippet above.
Not a general LLM. The model is fine-tuned hard for translation. It will refuse or mangle reasoning prompts. Pair it with a small instruct model in your pipeline if you need both.
Latin pairs are good but not category-leading. Strong against commercial APIs on Chinese-foreign, more competitive than dominant on English-French or English-German. Run your own A/B against your traffic mix before swapping a vendor.
SEQ quantization needs Arm SME2 or a recent llama.cpp. Older mobile chips will run the model but lose the latency claim.

Who should use which variant

Cloud SaaS replacing a paid translation API: HY-MT1.5-7B FP8 on a single H100 or RTX 6000 Ada gives you the quality ceiling without the cost.
Self-hosted on a workstation or homelab: HY-MT1.5-1.8B GGUF Q6_K hits the sweet spot of quality and 2GB footprint. Runs comfortably on Apple Silicon and on a 12GB RTX 3060.
On-device app or embedded device: Hy-MT1.5-1.8B-2bit at 574MB is the default. Drop to the 1.25-bit if you must fit under 500MB.
Building an agent that calls translation as a tool: the 1.8B BF16 served via vLLM keeps latency tight and concurrency high.

What to do in the next ten minutes

Pull the GGUF that fits your hardware: ollama run hf.co/tencent/HY-MT1.5-1.8B-GGUF:Q4_K_M for laptops, or grab the 2-bit GGUF for phones.
Drop the custom Modelfile template above into place so output stops returning "onse" loops.
Run the format-preserving prompt against one of your real localization strings and compare to your current pipeline.

If the quality gap is large enough, you have a credible path to retiring a commercial translation line item this quarter.

Tested on: vendor-reported benchmarks (Snapdragon 865 8GB, Snapdragon 7+ Gen 2 16GB, Apple M4, vivo x300) and Tencent's Flores-200 Chinese-foreign evaluation. Variants surveyed: HY-MT1.5-7B BF16/FP8/Int4, HY-MT1.5-1.8B BF16/FP8, GGUF Q4_K_M/Q6_K/Q8_0, SEQ 2-bit and 1.25-bit.
Date tested: 2026-05-02

Subscribe to the Newsletter

Search

GDPR Compliance

Log in

Create an account

Reset password

Terms of use

Information Collected by SingularityByte.com

How We Use This Information

Information Disclosure

Cookies, Trackers, and Online Ads

Other Sites

Information Security

Do-Not-Track

Additional Options

Microsoft Clarity

Contact Us

Midjourney SREF Styles:

What HY-MT1.5 actually is

The full quantization family at a glance

Why the 2-bit and 1.25-bit variants are interesting

Hands-on: pick a variant in five minutes

Run it under llama.cpp

Run it under Ollama

Run the 2-bit on a phone

Recommended sampling defaults

Prompt patterns Tencent supports out of the box

Benchmarks: how it stacks up

The 33 languages and 5 dialects

Limitations and gotchas

Who should use which variant

What to do in the next ten minutes

DeepSeek-V4

OpenThinker-32B

Related to this topic:

Latest topics

The Sections

About

Keep up to date with the latest updates & news