Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Hugging Face - Ecosystem

State of Open-Source AI Spring 2026

Hugging Face Spring 2026 report: Chinese models pull 41% of downloads, robotics datasets up 23x, Alibaba beats Google and Meta combined on derivatives.
2026-04-11
Updated 0

Hugging Face just published its State of Open Source AI: Spring 2026 report, and the headline number is hard to argue with. Chinese models now pull 41 percent of every download on the platform, China has overtaken the United States in monthly downloads, and Alibaba alone has more derivative models than Google and Meta combined. If you build on open weights, the gravity well moved while you were busy shipping.

We read the full report so you can plan your next quarter around it. Here is what changed, why it matters for builders, and what to do about it before the next wave hits.

The numbers that matter

The raw scale is now hard to ignore.

  • 2 million public models on Hugging Face, roughly doubled from a year ago.
  • 13 million users on the platform across 2025.
  • 500,000+ public datasets.
  • Over 30 percent of the Fortune 500 now maintain verified Hugging Face accounts.

The catch sits in the long tail. About half of those 2 million models have fewer than 200 lifetime downloads. The top 0.01 percent (just 200 models) account for 49.6 percent of all downloads on the platform. The head is brutal. If you publish a model and do not maintain it, you join the silent half within weeks.

China's open-source moment is now permanent

The DeepSeek-R1 moment of late 2024 was not a one-quarter event. It triggered a structural shift inside Chinese big tech, and the 2025 numbers prove it stuck.

  • Baidu went from 0 open releases in 2024 to over 100 in 2025.
  • ByteDance and Tencent each grew their release volume 8 to 9 times year over year.
  • MiniMax and several other labs flipped from closed to open distribution.
  • Chinese models now hit 41 percent of total platform downloads, and China passed the U.S. in monthly download volume.

Most-liked has shifted too. In 2024 the top of the leaderboard was a Llama variant. In 2025 it is DeepSeek-R1, with Qwen variants stacked behind it. Alibaba's Qwen family alone has spawned more than 113,000 directly tagged derivative models, and over 200,000 if you count every related tag. That is more derivatives than the entire Google and Meta lineages combined.

If your inference stack still defaults to "Llama unless I have a reason," that default is now eighteen months out of date.

The independent developer is the new big player

The most underreported shift in the report is who is doing the building.

  • Industry's share of model development fell from around 70 percent before 2022 to 37 percent in 2025.
  • Independent and unaffiliated developers rose from 17 percent to 39 percent in the same window.
  • Individual users now rank as the fourth most popular entity for producing trending models, ahead of several name-brand labs.

In plain terms: a solo developer fine-tuning Qwen on a single 4090 is now statistically more likely to produce a trending model than a small research lab. The Hugging Face authors put it bluntly: "Creating competitive models at a user level is more accessible than ever before."

For SingularityByte readers, this is the part you should print and tape to your monitor. Open weights plus 24GB of VRAM plus a weekend is a viable distribution channel in 2026.

Smaller models won, but not the way you think

The size story is more interesting than the usual "big is dead" take.

The mean size of downloaded models grew from 827M parameters in 2023 to 20.8B in 2025. So yes, deployed models got bigger on paper. But the median only moved from 326M to 406M. That gap tells you everything. A small number of large MoE models drags the mean upward, while the bulk of real-world usage stays under a billion parameters.

Concretely, the report finds that 1 to 9B parameter models are downloaded "at far higher rates" than 100B+ systems. The top-10 small models only beat the top-10 huge ones by a 4x download margin, despite needing 10 to 100 times less hardware. Practical constraints (cost, latency, the GPU actually sitting on your desk) still set the ceiling.

The lesson for builders: pick your fights. A well-chosen 7B that you can fine-tune, ship, and iterate on weekly will out-perform a 405B that you can barely run.

Engagement decays in six weeks

One stat we keep coming back to: mean engagement on a released model peaks immediately after release and decays to near-zero in about six weeks. Continuous updates are what keep a model on the leaderboard.

Translation: shipping a model is not a launch event. It is the start of a maintenance subscription you owe the community. Plan your roadmap accordingly, or accept that your release lives for a month and a half.

Robotics ate the dataset board

The biggest sub-community shift in the report has nothing to do with chatbots.

  • Robotics datasets on Hugging Face went from 1,145 in 2024 to 26,991 in 2025. That is a 23x year-over-year jump.
  • Robotics climbed from rank 44 to the number one dataset category in three years.
  • The next-largest category, text generation, holds roughly 5,000 datasets. Robotics is now more than 5x larger.
  • LeRobot's GitHub stars nearly tripled over the past year.

Two datasets to know. Learning to Drive (L2D) is the largest open multimodal corpus for spatial intelligence. RoboMIND ships 107,000+ real-world trajectories across 479 tasks and multiple robot embodiments.

If you have been treating "robotics" as someone else's vertical, the data says otherwise. The same open-source playbook of shared weights, shared datasets, and fine-tune-and-ship is now running in physical AI, and it is moving faster than language did at the equivalent stage.

AI for science is the second hot sub-community. Every frontier lab now runs a dedicated science team. ByteDance is producing high-impact papers at a notable clip, particularly in medical AI.

Five things builders should do this quarter

Concrete actions, not vibes.

  1. Re-baseline your default model. If your stack still defaults to a Llama variant from 2024, run the same eval against Qwen3, DeepSeek-V3.2, and GLM-5.1. The download data says one of those will beat your current default for the same VRAM budget.
  2. Audit your dependence on a single vendor. Chinese hardware support is now a real concern for some buyers and a real opportunity for others. Know which side you are on before procurement asks.
  3. Pick a 7B you can own. Ship a fine-tune of a 1 to 9B model in the next 30 days. The download data says this is where adoption actually happens.
  4. Watch the robotics datasets even if you do not build robots. Spatial reasoning data is leaking into general-purpose multimodal training. RoboMIND-style trajectories will show up in your next vision-language model whether you asked for them or not.
  5. Plan a 6-week update cycle. If you ship a model, schedule v0.1, v0.2, and v0.3 before you announce v0.0. Engagement decay is brutal, and the fix is cheap.

Risks and open questions

Three things the report does not fully resolve.

  • Concentration risk. Half of all downloads sit on 200 models. If any of those 200 ship a regression or a license change, a real fraction of the production AI internet wobbles.
  • Western response. OpenAI's GPT-OSS, AI2's OLMo, and Google's Gemma 4 are real attempts to reclaim ground, but the derivative volume tells a different story. Watch the next two quarters of Gemma 4 forks carefully.
  • Sovereign AI fragmentation. South Korea named five national champions (LG AI Research, SK Telecom, Naver Cloud, NC AI, Upstage). Three Korean models trended on Hugging Face simultaneously in February 2026. Europe is still arguing about data centers. The map of who funds what is going to get messier, not cleaner.

What to watch in the next 90 days

  • Whether Gemma 4's derivative count starts to close the gap on Qwen.
  • Whether any Korean champion ships a model that lands in the top 200 download bracket.
  • The next LeRobot release and the open robotics datasets that follow it. If the 23x dataset jump repeats, robotics overtakes language as the most active sub-community on Hugging Face by year end.
  • Whether the Fortune 500 verified-account number crosses 40 percent.

The TL;DR: open-source AI is not a niche anymore, it is not Western anymore, and it is not just language anymore. If you build with open weights, your competition just got more interesting and your defaults just got out of date.

Sources and further reading

 

Prev Article
Local AI Computing: Exploring NVIDIA DGX Spark, Apple M4 MAX Mac Studio, AMD Ryzen AI MAX +395
Next Article
Mistral released Le Chat

Related to this topic: