MegaTrain: How One GPU and 1.5TB of RAM Can Fine-Tune 120B Parameter Models

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Log in

Have no account yet? Sign up

Create an account

Already have an account? Log in

Reset password

Remember your password? Log in

Terms of use

SingularityByte.com values the privacy of our users. Therefore, this privacy policy explains in detail how we use and protect the information we collect when you visit our website.. Read this privacy policy completely. Please refrain from visiting the site if the terms outlined below are not satisfactory to you. We reserve the right to change this policy at any time and will list these changes in the updates section of the policy. By reading this notice and visiting the site, you agree that you understand that customers will not be personally notified when this policy changes. Therefore, we advise our customers to frequently review our privacy policy so that they remain aware of its updates. By using the site, you accept that the posted policy and all its changes apply to your interaction with SingularityByte.com.

Information Collected by SingularityByte.com

Personal information may be collected by this site in many ways. This information includes:

Personal identifying information like your name, address, email, phone number, age, gender, and other personal data
Server data related to the IP address you used to visit our website, which includes your address, browser, OS, access time, and site activity.
Financial information related to your orders including your payment method and identifying payment information. We rarely store financial information collected on our site for transaction purposes. That information gets sent directly to our payment processor.
Social network data including Facebook permissions and user information from other networks, provided you log onto our site using one of these media sites.
Mobile device information such as your device ID, model, and location, if you use our site by accessing trough our website.

How We Use This Information

Our website uses information collected to:
• Manage your account information
• Customize ads
• Deliver promotions
• Email your account confirmation
• Manage purchases and payments
• Increase site efficiency
• Notify you of updates
• Offer new products
• Monitor and prevent theft
• Request your customer feedback
• Resolve account disputes
• Respond to your service requests

Information Disclosure

Normally, your information stays on our site. However, below we have listed the situations that may
require us to share the information we collect from you:
• When required by law, such as for fraud protection
• With our third-party providers for payment processing and hosting
• With your consent for marketing purposes
• When you post comments on the site
• To our advertisers, affiliates, and partners
• If this site goes bankrupt and data must be transferred

Cookies, Trackers, and Online Ads

We may use cookies, trackers, web beacons, and other technology to customize our website to improve your experience. We may customize the site using this information. These trackers do not have access to your personal information and can be removed from your browser options. In addition, third-party software provides ads for our site for marketing campaigns. These programs have access to tracking technology to optimize your ad experience. For more information about these
ads, visit [link to the privacy policies of affiliate advertisers]. Website analytics such as through Google Analytics may also be used to track users
and remarket our website. We do not give these vendors access to your personal information.

Other Sites

Our website may contain links to third-party websites in the form of policies, ads, and other non-affiliated links. Once you leave our site, we are no longer responsible for how your information is collected and disclosed. Please refer to the privacy policies of those third-party sites for more information.

Information Security

We take technical and administrative precautions to protect your data, but we cannot guarantee its safety against all types of fraud or misuse. If you provide personal information, we cannot verify its total security against all types of interception.

Do-Not-Track

Some browsers offer Do-Not-Track settings to prevent any information from being distributed. Since these settings have not been legally established as standard practice, we do acknowledge these settings.

Additional Options

At any time, you may opt to review or change your account settings, including contact information. If you wish to delete your account, you may do so to remove most of your information, however, some identifying information will be retained to prevent fraud.
You may also opt-out of emails and other correspondences from our site at any time.

Microsoft Clarity

We partner with Microsoft Clarity and Microsoft Advertising to capture how you use and interact with our website through behavioral metrics, heatmaps, and session replay to improve and market our products/services. Website usage data is captured using first and third-party cookies and other tracking technologies to determine the popularity of products/services and online activity. Additionally, we use this information for site optimization, fraud/security purposes, and advertising. For more information about how Microsoft collects and uses your data, visit the Microsoft Privacy Statement.

Contact Us

If you have questions or concerns about this privacy policy, please feel free to contact us at: desk@SingularityByte.com

Do you agree to our terms? Sign up

License Apache 2.0

TL;DR

Trains 120B parameter LLMs on a single GPU at full BF16 precision
1.84x faster than DeepSpeed ZeRO-3 with CPU offloading at 14B scale
Targets post-training: SFT, RLHF, instruction tuning, domain adaptation

☍ Announcement ⬇ Download Model

System Requirements

RAM	256GB
GPU	NVIDIA H200
VRAM	80GB+

Table of Contents

Training a 100-billion-parameter model used to mean cluster access, six-figure cloud bills, and weeks of debugging distributed systems. MegaTrain, a new open-source framework from researchers at Notre Dame and Lehigh, flips that script: it trains 120B-parameter LLMs at full precision on a single GPU by treating CPU RAM as the primary parameter store and the GPU as a transient compute engine. The code is on GitHub under Apache 2.0, and the results beat DeepSpeed ZeRO-3 by up to 6x on throughput.

What Changed

120B parameters, one GPU: On a single H200 with 1.5TB host RAM, MegaTrain trains models up to 120B parameters at full BF16 precision.
1.84x faster than DeepSpeed ZeRO-3 at 14B scale, scaling to 6.14x at 56 layers where ZeRO-3 crawls to 43 TFLOPS and FSDP hits OOM.
512K context on a single GH200: 7B model training with half-million token sequences, no multi-node setup needed.
Apache 2.0, broad model support: Works with Qwen, Llama, DeepSeek, Mistral, Phi, Gemma, GLM, MoE models (Mixtral, Llama 4), and VLMs (Qwen-VL, LLaVA, InternVL).

Why This Matters for Open-Source Builders

Here is the stat that frames this entire paper: only 2 of 167 U.S. universities average more than one H100 GPU per student. If you are a grad student, indie researcher, or small startup, cluster access is a bottleneck that shapes what you can even attempt. MegaTrain turns a single-GPU workstation (roughly $35K for an H200 plus 1.5TB DDR5) into a post-training rig that handles models most teams need a $80K to $200K cluster to fine-tune.

The key word is post-training. This is not about pretraining GPT-5 from scratch. MegaTrain targets the workflows where most open-source practitioners actually spend their GPU hours: supervised fine-tuning (SFT), instruction tuning, RLHF, and domain adaptation. That is exactly the work that turns a base model into something useful for your specific problem.

How It Works

Traditional training frameworks treat GPU memory as the primary store for parameters, gradients, and optimizer states. MegaTrain inverts this. Parameters and Adam optimizer states (FP32 moments m and v) live in CPU RAM. The GPU holds only what it needs right now: one layer's weights, the current activation checkpoint block, and intermediate tensors.

Double-Buffered 3-Stream Pipeline

The CPU-GPU bandwidth bottleneck is the obvious problem with offloading. MegaTrain solves it with three concurrent CUDA streams:

Compute stream: runs forward/backward kernels for layer i
H2D stream: prefetches layer i+1 weights from CPU to GPU
D2H stream: evacuates gradients from layer i-1 back to CPU

Event synchronization keeps everything safe: the compute stream waits for a "weights-ready" event before touching a layer, and the D2H stream waits for "backward-done" before pulling gradients. The result is a steady-state pipeline where PCIe latency hides behind computation.

Stateless Layer Templates

Standard PyTorch maintains persistent autograd graphs with weight pointers for every layer. At 120B parameters, that graph metadata alone eats significant GPU memory. MegaTrain replaces this with reusable CUDA kernel templates (one for Attention, one for MLP) that have no persistent state. Before each layer executes, a "bind" operation maps the freshly-streamed parameters to the template's input slots. Two templates alternate in a ping-pong pattern, so layer i computes on template A while layer i+1 binds to template B.

Block-Wise Recomputation

Instead of checkpointing every layer (standard gradient checkpointing), MegaTrain stores activation checkpoints every K layers and recomputes forward activations within each block during the backward pass. This reduces peak activation memory from O(N * A_max * L) to O(N * A_max * L/K) without the per-layer overhead.

CPU-Side Optimizer

Adam updates run entirely on the CPU using AVX-512 vectorized instructions. This avoids round-tripping optimizer states to the GPU, which would double the PCIe traffic.

Benchmarks

Scale	MegaTrain	ZeRO-3	FSDP	Notes
14B (28 layers)	284 TFLOPS	154 TFLOPS	~150 TFLOPS	1.84x speedup
14B (56 layers)	264 TFLOPS	43 TFLOPS	OOM	6.14x speedup
21B (84 layers)	255 TFLOPS	OOM	OOM	MegaTrain only
43B (180 layers)	227 TFLOPS	OOM	OOM	Still >200 TFLOPS
120B	Stable	OOM	OOM	H200 + 1.5TB RAM

On commodity hardware (A100 PCIe, 80GB, 600GB host RAM), MegaTrain hits 128 TFLOPS at 7B, a 2.42x speedup over the Gemini baseline. Accuracy matches ZeRO-3 within 0.1% on the MetaMathQA benchmark (88.99% vs 88.93% at 7B, 92.52% vs 92.41% at 14B).

Long context performance is notable: throughput actually increases from 284.7 to 407.4 TFLOPS when going from 1K to 512K tokens on the GH200, because longer sequences improve compute intensity relative to transfer overhead.

Hardware Requirements

Setup	GPU	Host RAM	Bandwidth	Max Model	Approx. Cost
Budget	RTX 3090 (24GB)	251GB DDR4	PCIe Gen3	~7B	~$3K used
Mid-range	A100 PCIe (80GB)	600GB DDR4	PCIe Gen4	~27B	~$15K
Sweet spot	GH200 (96GB)	480GB LPDDR5X	NVLink-C2C (900 GB/s)	~43B	~$25K
Full scale	H200 SXM (141GB)	1.5TB DDR5	PCIe Gen4	120B	~$35K

An important caveat: the GH200's NVLink-C2C delivers 900 GB/s between CPU and GPU, which is roughly 7x faster than PCIe Gen4. The headline numbers come from hardware with this kind of bandwidth. On PCIe-only systems, expect lower throughput at larger model sizes.

Getting Started

MegaTrain requires Python 3.9+ and PyTorch 2.0+. Installation is straightforward:

git clone https://github.com/DLYuanGod/MegaTrain.git
cd MegaTrain
pip install -e .

Optional but recommended: install flash-attn and deepspeed for additional optimizations.

Run a supervised fine-tuning job with one of the provided configs:

# SFT on Qwen3.5-27B
python examples/sft/train.py --config examples/sft/configs/qwen3_5_27b.yaml

# RLHF via GRPO on Qwen3.5-27B
python examples/rl/train_grpo.py --config examples/rl/configs/qwen3_5_27b_grpo.yaml

Configs are YAML-based and use a LlamaFactory-compatible dataset registry. You will need to calculate batch size using their provided resource calculator to match your available memory.

Limitations and Gotchas

Post-training only. MegaTrain is designed for fine-tuning, SFT, RLHF, and domain adaptation. Pre-training from scratch at 100B+ scale still needs distributed systems.
Single GPU only. Multi-GPU support is listed as future work. If you have 2-4 GPUs, DeepSpeed or FSDP is still your go-to for now.
CPU RAM is the new bottleneck. A 120B model needs roughly 1.5TB of host memory for parameters plus optimizer states. That is a lot of DIMMs.
Early-stage repo. 26 commits, around 400 GitHub stars, no tagged releases. The code works, but this is research software, not production tooling.
PCIe bandwidth matters. If your system uses PCIe Gen3 or Gen4 instead of NVLink-C2C, throughput at larger model sizes will drop significantly. The GH200 numbers are best-case.
Decoder-only architectures. Encoder-decoder models (T5, BART) are not supported.

Who Should Use This

MegaTrain is for anyone who needs to fine-tune large models without cluster access. Grad students adapting a 70B model for their thesis domain. Small AI startups doing RLHF on a single workstation. Indie hackers who want to instruction-tune a 27B model overnight without paying for cloud GPUs. If your workflow is "take an existing open-weight model and make it good at my specific task," this is worth trying.

The paper and code are available now. Clone the repo, pick a config close to your hardware, and see how far one GPU can take you.

Sources

Paper benchmarks reported on: NVIDIA H200 SXM (141GB HBM3e, 1.5TB DDR5), GH200 (96GB HBM3, 480GB LPDDR5X), A100 PCIe (80GB HBM2e, 600GB DDR4), RTX A6000, RTX 3090. Date tested: 2026-04-06 (paper publication date). SingularityByte did not independently reproduce these benchmarks.