Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Heretic - Model Fine-Tuning

Heretic: One-Command Abliteration

Heretic by p-e-w turns abliteration into a one-command, ~45-minute job on an RTX 3090. Open-source uncensoring without manual layer tuning.

License Other
License Other
TL;DR
  • Automated abliteration via Optuna TPE: refusal count and KL divergence co-optimization replace manual layer tuning.
  • pip install -U heretic-llm, then one CLI command on any HuggingFace model ID. ~45 min on RTX 3090 for 8B-class models.
  • Outperforms hand-tuned abliterations on Gemma-3-12B: 0.16 KL vs 1.04 KL for the best human-tuned variant.
System Requirements
RAM32GB
GPURTX 3090 / 4090
VRAM24GB

Abliteration used to be a research move that took afternoon-long PyTorch sessions, hand-picked layers, and a stomach for spending weights you might never get back. Heretic compresses that into one command. pip install -U heretic-llm, then heretic Qwen/Qwen3-4B-Instruct-2507, and roughly 45 minutes later you have an uncensored variant whose damage to the base model is smaller than anything a human has hand-tuned for it. There are already more than a thousand community Heretic variants on HuggingFace, and the wait between a fresh open-weights release and its decensored twin is collapsing toward zero.

Why abliteration suddenly matters to open-source builders

For anyone running local models, safety alignment is a tax. It pays for itself when you ship a consumer chatbot, but it gets in the way of agent loops, red-team work, fiction, security research, creative writing, and dozens of legitimate uses that hit a wall the moment the model decides to apologize. Until recently, the only way around it was a curated jailbreak prompt or a full fine-tune.

Heretic kills the wait. The day a new base model lands on HuggingFace, somebody runs heretic against it, pushes the result, and within hours bartowski and mradermacher have GGUFs out for every common quantization. That is what the 1,000+ Heretic uploads on HuggingFace actually represent: an end-to-end pipeline where any aligned open model is one command and a few hours of community labor away from being usable for whatever you actually want to do with it.

What abliteration actually is

Abliteration is a surgical edit to a model's weights, not a fine-tune. It rests on a 2024 result from Arditi et al. (NeurIPS 2024), which showed that refusal in transformer language models is mediated by a single direction in the residual stream. That direction is consistent across 13 open-weights chat models, all the way up to 72B parameters. Project it out of the activations, or subtract it from the relevant weight matrices, and the model stops refusing without losing the rest of its skills.

The technique reached a wider audience through Maxime Labonne's HuggingFace blog post, which turned the Arditi paper into a recipe anyone with a GPU could follow. The recipe worked, but it was finicky. You had to pick which layers to read activations from, decide how strongly to project, and eyeball whether the resulting model was still coherent. Get the layer wrong and the model talked freely but lost its math. Project too weakly and it still refused.

How Heretic differs from the manual approach

Heretic treats abliteration as an optimization problem. It uses Optuna's Tree-structured Parzen Estimator to search the joint space of layer ranges, ablation weights, and direction indices, co-minimizing two objectives at every trial: the count of refusals on a harmful-prompt probe set, and the KL divergence between the patched model and the original. The model that comes out the other end is the one that refuses least while drifting least.

The numbers from the repo are striking. A search that takes 30 to 90 minutes on a single RTX 3090 now outperforms expert tuning that took days.

Approach (Gemma-3-12B-IT)KL divergence vs baseRefusals / 100Human effort
Heretic, default config0.163One CLI command
Best hand-tuned abliteration1.04~3Days of expert tuning

Same refusal rate, roughly 6.5 times less damage to the base model's behavior, and the cost of producing the Heretic row is one terminal command on a consumer GPU.

Heretic builds on extensions to the original technique, including Lai 2025's projected abliteration and norm-preserving biprojected abliteration. The author, p-e-w (Philipp Emanuel Weidmann), keeps the codebase under AGPL-3.0 and ships sensible defaults so the typical run is a single argument.

Hands-on: from pip install to your own variant

Install

pip install -U heretic-llm

Heretic needs Python 3.10 or newer, PyTorch 2.2+ (2.6 recommended), and a CUDA GPU. An RTX 3090 with 24 GB of VRAM is the sweet spot for 8B to 12B models. Hybrid models like Qwen3.5 work, multimodal models work, and most MoE architectures work. Pure state-space models and a handful of research architectures are still on the to-do list.

Run

heretic Qwen/Qwen3-4B-Instruct-2507

That is the whole interface. Heretic streams the model, benchmarks batch sizes on your GPU, runs the optimizer for a few dozen trials, prints the final KL and refusal numbers, and writes the patched weights to disk. If you want to bias the search toward retaining more or refusing less, the config.default.toml and config.noslop.toml presets in the repo are good starting points.

Quantize and share

Patched weights are large. The community convention is to upload the full-precision Heretic variant to your HuggingFace account, then wait a day or two for bartowski or mradermacher to publish imatrix GGUFs in the usual Q4_K_M, Q5_K_M, and IQ4_XS flavors. If you cannot wait, the llama.cpp conversion scripts handle the same job in one pass.

Names to know in the scene

p-e-w is the author of Heretic and the maintainer of the heretic-org HuggingFace organization. The repo is the canonical reference; the org curates clean Heretic variants for popular base models.

mlabonne popularized abliteration in the first place and continues to ship merges, fine-tunes, and educational material. His blog is the recommended entry point for anyone who wants to understand what the projection step is doing.

huihui-ai runs the most prolific uncensoring operation on HuggingFace, with 200+ variants spanning Qwen, DeepSeek, Llama, Gemma, and most major open families. If a base model dropped last week, huihui-ai almost certainly has an abliterated version of it already.

DavidAU takes things further. His "Dark Champion" line is a series of MoE merges that combine abliterated experts with creative-writing fine-tunes, and the resulting models have a cult following among fiction writers and roleplayers running local stacks. The Llama-3.2-8X3B Dark Champion is a representative entry.

bartowski and mradermacher are the GGUF quant-masters. Between them they keep current quantizations available for essentially every model the local-LLM crowd cares about. They are unpaid infrastructure for the entire ecosystem, and if you run Ollama, LM Studio, or llama.cpp you have almost certainly downloaded one of their files.

Limitations, ethics, and the AGPL question

Abliteration removes the refusal behavior. It does not remove the training distribution. A Heretic variant of a model that learned bad chemistry from filtered web text still does not know good chemistry. It will answer; the answer can be wrong. The same goes for medical advice, legal advice, and any other domain where the base model was already weak. Treat the output the same way you would treat any local model: as a draft from a confident-sounding intern.

The legal frame matters too. Heretic itself is AGPL-3.0, which means any service that exposes the tool over a network has to make its source available. The model weights Heretic produces are governed by the upstream license of whatever base model you patched. A Heretic variant of Llama-3 still inherits the Llama license. A Heretic variant of Qwen is still Qwen-licensed. Read those before redistribution.

Sources and further reading

Benchmarks community-reported from p-e-w's repo and HuggingFace variant cards. Not independently verified. Compiled 2026-05-19.

Prev Article
Exploring AirLLM: Running Massive 70B LLMs on a 4GB GPU
Next Article
OpenManus

Related to this topic: