Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Arcee AI - Reasoning

Arcee AI Trinity-Large-Thinking

Arcee AI Trinity-Large-Thinking is a 398B Apache 2.0 sparse MoE with 13B active parameters, chain-of-thought reasoning, 512k context, and PinchBench scores just behind Claude Opus 4.6.

License Apache 2.0
TL;DR
  • 398B sparse MoE with 256 experts, only 13B active per token. Apache 2.0 open weights. Runs 2-3x faster than comparably-sized dense models.
  • PinchBench 91.9% (#2 behind Opus 4.6), AIME25 96.3%, LiveCodeBench 98.2%. Designed for long-horizon agents and multi-turn tool calling.
  • Full Trinity family: Large (398B/13B), Mini (26B/3B), Nano (6B/1B). All Apache 2.0 with reasoning via blocks.
System Requirements
RAM32GB (Q4 GGUF)
GPUMulti-GPU cluster or API
VRAM800GB BF16 / 26GB+ Q4
✓ Ollama

Arcee AI just shipped Trinity-Large-Thinking, a 398-billion-parameter open-weight reasoning model that punches into the same tier as Claude Opus 4.6 and GPT-5 on agentic benchmarks. The catch: it only fires 13 billion parameters per token, thanks to a sparse Mixture-of-Experts architecture with 256 experts and 4 active at inference. It is Apache 2.0 licensed, ships with 512k context, and was trained on 17 trillion tokens. For open-source builders who need strong tool-calling, multi-turn reasoning, and long-horizon agent loops, this is the most interesting U.S.-made open model to land in months.

What Arcee Built

Trinity-Large-Thinking is a sparse MoE transformer with 398 billion total parameters. At inference, only 13 billion parameters activate per token thanks to sigmoid routing across 256 experts (4 selected per forward pass). That keeps compute costs dramatically lower than a dense model of similar total size.

The architecture mixes interleaved local and global attention layers with gated mechanisms and Grouped Query Attention (GQA). Normalization uses a depth-scaled sandwich norm, which stabilizes training at this parameter count. Context length reaches 512k tokens, extended specifically to support the long reasoning chains the thinking mode produces.

Pre-training ran on 2,048 NVIDIA B300 GPUs across 17 trillion tokens. Post-training used agentic reinforcement learning on 1,152 H100s, focused on multi-step tool use, code generation, and structured reasoning. The full pipeline combines massive scale pre-training with targeted RL that teaches the model how to think through complex tasks before answering.

How the Thinking Mode Works

Trinity-Large-Thinking generates explicit reasoning traces wrapped in <think>...</think> tags before producing its final answer. This is similar to the approach used by DeepSeek-R1 and QwQ, but Arcee trained the thinking behavior through agentic RL rather than pure supervised distillation.

In practice, you send a prompt and the model first writes out its chain-of-thought inside the think tags, then delivers the response. For multi-turn conversations, you must preserve the reasoning traces in the conversation history. Stripping them between turns degrades performance significantly.

Arcee recommends a temperature of 0.3 for best results. Higher temperatures introduce noise into the reasoning chain; lower ones can make the model repetitive. The 512k context window gives the model room to produce long reasoning traces without running out of space, which matters for complex agentic tasks that require dozens of tool calls.

Benchmarks That Matter

Trinity-Large-Thinking posts strong numbers on agentic and coding benchmarks. Here is how it stacks up:

BenchmarkTrinity-Large-ThinkingNotes
PinchBench91.9%#2 overall, behind Claude Opus 4.6
tau2-Bench94.7%Multi-step agent evaluation
LiveCodeBench98.2%Real-time coding tasks
AIME2596.3%Math competition problems
SWE-bench Verified63.2%Real GitHub issue resolution
MMLU (base)82.58%General knowledge (base model)
BBH (base)65.70%Big-Bench Hard (base model)
MBPP+ (base)88.62%Python coding (base model)

The agentic scores (PinchBench, tau2-Bench) are where Trinity shines. LiveCodeBench at 98.2% and AIME25 at 96.3% are both top-tier. SWE-bench Verified at 63.2% lags behind the best closed models, and the base model MMLU of 82.58% is solid but not class-leading. More on those gaps in the limitations section.

The Trinity Family

Arcee released three models in the Trinity family, all Apache 2.0. Each uses the same MoE architecture at different scales:

ModelTotal ParamsActive ParamsTarget Use Case
Trinity-Large398B13BAgentic workflows, tool-calling, complex reasoning
Trinity-Mini26B3BEdge deployment, mobile, fast inference
Trinity-Nano6B1BOn-device, embedded, IoT

The Mini and Nano variants make the architecture accessible at smaller scales. If you are building an agent pipeline where latency matters more than peak accuracy, Trinity-Mini at 3B active parameters is worth evaluating against Qwen3-4B and Gemma 4 2B.

Who Built This

Arcee AI is a Miami-based startup founded in 2023. CEO Mark McQuade and CTO Jacob Solawetz lead the team. They have raised $29.5 million, with Emergence Capital as lead investor. Their partner list includes NVIDIA, Intel, AWS, Microsoft, and Hugging Face.

Arcee previously focused on enterprise model merging and fine-tuning tools. Trinity marks their shift into building frontier-class base models from scratch. The fact that a sub-100-person startup produced a model competing with Anthropic and OpenAI on agentic benchmarks is notable. It also makes Trinity one of the few top-performing open models originating from a U.S. company rather than Chinese labs like DeepSeek or Qwen.

Get It Running

Trinity-Large-Thinking is available through several channels. Here are your options:

Hugging Face (full weights):

pip install transformers
huggingface-cli download arcee-ai/Trinity-Large-Thinking

Full BF16 weights require around 800GB of VRAM. This is not consumer hardware territory.

GGUF quantized (for llama.cpp / Ollama):

# Requires llama.cpp b7061 or newer
ollama run arcee-ai/Trinity-Large-Thinking-GGUF

Q4 quantized variants reduce VRAM requirements significantly. Check the GGUF repository on Hugging Face for available quant levels.

OpenRouter API:

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arcee-ai/trinity-large-thinking",
    "temperature": 0.3,
    "messages": [{"role": "user", "content": "Your prompt here"}]
  }'

OpenRouter pricing is $0.90 per million output tokens. That is competitive for a model at this performance level.

vLLM (self-hosted inference): vLLM supports Trinity-Large-Thinking with reasoning-parser support for the <think> tags. You will need multi-GPU setups with sufficient aggregate VRAM.

DigitalOcean: Also available on DigitalOcean's Agentic Inference Cloud for managed deployment.

What to Watch Out For

Trinity-Large-Thinking has clear limitations you should know before committing to it.

VRAM requirements are brutal. Full BF16 inference needs roughly 800GB of VRAM. That means multi-node GPU setups or cloud instances with 8+ A100/H100 cards. GGUF quantization helps, but you are still looking at substantial hardware for the Large variant. If you need something local, look at Trinity-Mini or Trinity-Nano.

SWE-bench Verified at 63.2% trails the leaders. Claude Opus 4.6 and other top models score higher on real-world software engineering tasks. If your primary use case is autonomous code fixes on production repos, Trinity is competitive but not best-in-class on this specific benchmark.

Base model knowledge scores are good, not great. MMLU at 82.58% and BBH at 65.70% are respectable but below what you would expect from a 398B-parameter model. The training budget was clearly optimized for agentic and reasoning performance rather than raw knowledge recall. That tradeoff makes sense for the target use case, but keep it in mind for general-purpose Q&A workloads.

Reasoning traces consume context. The <think> traces can be long, and you must keep them in conversation history for multi-turn coherence. On complex agent chains, you may burn through the 512k context faster than expected.

Who Should Use Trinity

Trinity-Large-Thinking is built for a specific profile: developers and teams building multi-step agentic systems that require strong reasoning, tool calling, and code generation. If you are wiring up agent loops with function calling, structured outputs, and multi-turn chains, the agentic benchmark scores suggest Trinity competes at the frontier level.

Specific use cases where it fits well:

  • Agentic coding assistants that need to plan, execute, and iterate across multiple tool calls
  • Long-horizon task completion where the 512k context and thinking traces help maintain coherence
  • Open-source-only deployments where Apache 2.0 licensing is a hard requirement
  • Research teams studying chain-of-thought reasoning with full access to model weights

If you just need a general chat model or a small local assistant, Trinity-Mini or Trinity-Nano are better fits. For the Large variant, plan for cloud-scale infrastructure or use the OpenRouter API.

Pull the GGUF from Hugging Face, spin up Ollama, and test a multi-turn tool-calling chain against your current model. That is the fastest way to see if Trinity earns a spot in your stack.

Sources

Prev Article
Meta Muse Spark
Next Article
Overworld Waypoint-1.5

Related to this topic: