The January 2025 release of DeepSeek-R1 initiated an avalanche of articles about DeepSeek—which, somewhat confusingly, is the name of a company and the models it makes and the chatbot that runs on those models. Given the volume of coverage and the excitement around the economics of a seismic shift in the AI landscape, it can be hard to separate fact from speculation and speculation from fiction.
What follows is a straightforward guide to help you sort through other articles about DeepSeek, separate signal from noise and skip over hype and hyperbole. We’ll start with some brief company history, explain the differences between each new DeepSeek model and break down their most interesting innovations (without getting too technical).
Here’s a quick breakdown of what we’ll cover:
DeepSeek is an AI research lab based in Hangzhou, China. It is also the name of the open weight generative AI models it develops. In late January 2025, their DeepSeek-R1 LLM made mainstream tech and financial news for performance rivaling that of top proprietary models from OpenAI, Anthropic and Google at a significantly lower price point.
The origins of DeepSeek (the company) lie in those of High-Flyer, a Chinese hedge fund founded in 2016 by a trio of computer scientists with a focus on algorithmic trading strategies. In 2019, the firm used proceeds from its trading operations to establish an AI-driven subsidiary, High-Flyer AI, investing a reported USD 28 million in deep learning training infrastructure and quintupling that investment in 2021.
By 2023, High-Flyer’s AI research had grown to the extent that it warranted the establishment of a separate entity focused solely on AI—more specifically, on developing artificial general intelligence (AGI). The resulting research lab was named DeepSeek, with High-Flyer serving as its primary investor. Beginning with DeepSeek-Coder in November 2023, DeepSeek has developed an array of well-regarded open-weight models focusing primarily on math and coding performance.
In December 2024, the lab released DeepSeek-V3, the LLM on which DeepSeek-R1 is based. The breakthrough performances of DeepSeek-V3 and DeepSeek-R1 have positioned the lab as an unexpected leader in generative AI development moving forward.
DeepSeek-R1 is a reasoning model created by fine-tuning an LLM (DeepSeek-V3) to generate an extensive step-by-step chain of thought (CoT) process before determining the final “output” it gives the user. Other reasoning models include OpenAI’s o1 (based on GPT-4o) and o3, Google’s Gemini Flash 2.0 Thinking (based on Gemini Flash) and Alibaba’s open QwQ (“Qwen with Questions”), based on its Qwen2.5 model.
The intuition behind reasoning models comes from early research demonstrating that simply adding the phrase “think step by step” significantly improves model outputs.i Subsequent research from Google DeepMind theorized that scaling up test-time compute (the amount of resources used to generate an output) could enhance model performance as much as scaling up train-time compute (the resources used to train a model).
Though reasoning models are slower and more expensive—you still must generate (and pay for) all of the tokens used to “think” about the final response, and those tokens eat into your available context window—they have pushed the vanguard of state-of-the-art performance since OpenAI’s release of o1. Most notably, the emphasis on training models to prioritize planning and forethought has made them adept at certain tasks involving complex math and reasoning problems previously inaccessible to LLMs.
For more on reasoning models, check out this excellent visual guide from Maarten Grootendorst.
DeepSeek-R1’s performance rivals that of leading models, including OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet, on math, code and reasoning tasks. Regardless of which model is “best”—which is subjective and situation-specific—it’s a remarkable feat for an open model. But the most important aspects of R1 are the training techniques that it introduced to the open source community.
Typically, the process of taking a standard LLM from untrained to ready for end users is as follows:
For proprietary reasoning models such as o1, the specific details of this final step are typically a closely guarded trade secret. But DeepSeek has released a technical paper detailing their process.
In their first attempt to turn DeepSeek-V3 into a reasoning model, DeepSeek skipped SFT and went directly from pretraining to a simple reinforcement learning scheme:
The resulting model (which they released as “DeepSeek-R1-Zero”) learned to generate complex chains of thought and employ reasoning strategies that yielded impressive performance on math and reasoning tasks. The process was straightforward and avoided costly labeled data for SFT. Unfortunately, as the technical paper explains, “DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability and language mixing.”
To train R1-Zero’s successor, DeepSeek-R1, DeepSeek amended the process:
But that fine-tuning process is only half of the story. The other half is the base model for R1: DeepSeek-V3.
DeepSeek-V3, the backbone of DeepSeek-R1, is a text-only, 671 billion (671B) parameter mixture of experts (MoE) language model. Particularly for math, reasoning and coding tasks, it’s arguably the most capable open source LLM available as of February 2025. More importantly, it’s significantly faster and cheaper to use than other leading LLMs.
671 billion parameters means it's a huge model. For context, when Meta released Llama 3.1 405B—which is 40% smaller than DeepSeek-V3—in July 2024, their official announcement described it as “the world’s largest and most capable openly available foundation model.”ii The original ChatGPT model, GPT-3.5, had 175 billion parameters. It's worth noting that most major developers, including OpenAI, Anthropic and Google, don’t disclose the parameter count of their proprietary models.
A larger parameter count typically increases a model’s “capacity” for knowledge and complexity. More parameters mean more ways to adjust the model, which means a greater ability to fit the nooks and crannies of training data. But increasing a model’s parameter count also increases computational requirements, making it slower and more expensive.
So how is DeepSeek-V3 (and therefore DeepSeek-R1) fast and cheap? The answer lies primarily in the mixture of experts architecture and how DeepSeek modified it.
A mixture of experts (MoE) architecture divides the layers of a neural network into separate sub-networks (or expert networks) and adds a gating network that routes tokens to select “experts.” During training, each “expert” eventually becomes specialized for a specific type of token—for instance, one expert might learn to specialize in punctuation while another handles prepositions—and the gating network learns to route each token to the most appropriate expert(s).
Rather than activating every model parameter for each token, an MoE model activates only the “experts” best suited to that token. DeepSeek-V3 has a total parameter count of 671 billion, but it has an active parameter count of only 37 billion. In other words, it only uses 37 billion of its 671 billion parameters for each token it reads or outputs.
Done well, this MoE approach balances the capacity of its total parameter count with the efficiency of its active parameter count. Broadly speaking, this explains how DeepSeek-V3 offers both the capabilities of a massive model and the speed of a smaller one.
MoEs got a lot of attention when Mistral AI released Mixtral 8x7B in late 2023, and GPT-4 was rumored to be an MoE. While some model providers—notably IBM® Granite™, Databricks, Mistral and DeepSeek—have continued work on MoE models since then, many continue to focus on traditional “dense” models.
So if they're so great, why aren’t MoEs more ubiquitous? There are 2 simple explanations:
DeepSeek-V3 features a number of clever engineering modifications to the basic MoE architecture that increase its stability while decreasing its memory usage and further reducing its computation requirements. Some of these modifications were introduced in its predecessor, DeepSeek-V2, in May 2024. Here are 3 notable innovations:
The attention mechanism that powers LLMs entails a massive number of matrix multiplications (often shortened to “matmul” in diagrams) to compute how each token relates to the others. All of those intermediate calculations must be stored in memory as things move from input to final output.
Multi-head latent attention (MLA), first introduced in DeepSeek-V2, “decomposes” each matrix into 2 smaller matrices. This doubles the number of multiplications, but greatly reduces the size of all that stuff you need to store in memory. In other words, it lowers memory costs (while increasing computational costs)—which is great for MoEs, since they already have low computational costs (but high memory costs).
In short: the specific values of each parameter in DeepSeek-V3 are represented with fewer decimal points than usual. This reduces precision, but increases speed and further reduces memory usage. Usually, models are trained at a higher precision—often 16-bit or 32-bit—then quantized down to FP8 afterward.
Multi-token prediction is what it sounds like: instead of predicting only one token a time, the model preemptively predicts some of the next tokens too—which is easier said than done.
No. Technically, DeepSeek reportedly spent about USD 5.576 million on the final pre-training run for DeepSeek-V3. However, that number has been taken dramatically out of context.
DeepSeek has not announced how much it spent on data and compute to yield DeepSeek-R1. The widely reported “USD 6 million” figure is specifically for DeepSeek-V3.
Furthermore, citing only the final pretraining run cost is misleading. As IBM’s Kate Soule, Director of Technical Product Management for Granite, put it in an episode of the Mixture of Experts Podcast: “That’s like saying if I’m gonna run a marathon, the only distance I’ll run is [that] 26.2 miles. The reality is, you’re gonna train for months, practicing, running hundreds or thousands of miles, leading up to that 1 race.”
Even the DeepSeek-V3 paper makes it clear that USD 5.576 million is only an estimate of how much the final training run would cost in terms of average rental prices for NVIDIA H800 GPUs. It excludes all prior research, experimentation and data costs. It also excludes their actual training infrastructure—one report from SemiAnalysis estimates that DeepSeek has invested over USD 500 million in GPUs since 2023—as well as employee salaries, facilities and other typical business expenses.
To be clear, spending only USD 5.576 million on a pretraining run for a model of that size and ability is still impressive. For comparison, the same SemiAnalysis report posits that Anthropic’s Claude 3.5 Sonnet—another contender for the world's strongest LLM (as of early 2025)—cost tens of millions of USD to pretrain. That same design efficiency also enables DeepSeek-V3 to be operated at significantly lower costs (and latency) than its competition.
But the notion that we have arrived at a drastic paradigm shift, or that western AI developers spent billions of dollars for no reason and new frontier models can now be developed for low 7-figure all-in costs, is misguided.
DeepSeek-R1 is impressive, but it’s ultimately a version of DeepSeek-V3, which is a huge model. Despite its efficiency, for many use cases it’s still too large and RAM-intensive.
Rather than developing smaller versions of DeepSeek-V3 and then fine-tuning those models, DeepSeek took a more direct and replicable approach: using knowledge distillation on smaller open source models from the Qwen and Llama model families to make them behave like DeepSeek-R1. They called these models “DeepSeek-R1-Distill.”
Knowledge distillation, in essence, is an abstract form of model compression. Rather than just training a model directly on training data, knowledge distillation trains a “student model” to emulate the way a larger “teacher model” processes that training data. The student model’s parameters are adjusted to produce not only the same final outputs as the teacher model, but also the same thought process—the intermediate calculations, predictions or chain-of-thought steps—as the teacher.
Despite their names, the “DeepSeek-R1-Distill” models are not actually DeepSeek-R1. They are versions of Llama and Qwen models fine-tuned to act like DeepSeek-R1. While the R1-distills are impressive for their size, they don’t match the “real” DeepSeek-R1.
So if a given platform claims to offer or use “R1,” it’s wise to confirm which “R1” they’re talking about.
Between the unparalleled public interest and unfamiliar technical details, the hype around DeepSeek and its models has at times resulted in the significant misrepresentation of some basic facts.
For example, early February featured a swarm of stories about how a team from UC Berkeley apparently “re-created” or “replicated” DeepSeek-R1 for only USD 30.iii iv v That’s a deeply intriguing headline with incredible implications if true—but it’s fundamentally inaccurate in multiple ways:
In short, the UC Berkeley team did not re-create DeepSeek-R1 for USD 30. They simply showed that DeepSeek’s experimental, reinforcement learning-only fine-tuning approach, R1-Zero, can be used to teach small models to solve intricate math problems. Their work is interesting, impressive and important. But without a fairly detailed understanding of DeepSeek’s model offerings—which many busy readers (and writers) don’t have time for—it’s easy to get the wrong idea.
As developers and analysts spend more time with these models, the hype will probably settle down a bit. Much in the same way that an IQ test alone is not an adequate way to hire employees, raw benchmark results are not enough to determine whether any model is the “best” for your specific use case. Models, like people, have intangible strengths and weaknesses that take time to understand.
It will take a while to determine the long-term efficacy and practicality of these new DeepSeek models in a formal setting. As WIRED reported in January, DeepSeek-R1 has performed poorly in security and jailbreaking tests. These concerns will likely need to be addressed to make R1 or V3 safe for most enterprise use.
Meanwhile, new models will arrive and continue to push the state of the art. Consider that GPT-4o and Claude 3.5 Sonnet, the leading closed-source models against which DeepSeek's models are being compared, were first released last summer: a lifetime ago in generative AI terms. Following the release of R1, Alibaba announced the impending release of their own massive open source MoE model, Qwen2.5-Max, that they claim beats DeepSeek-V3 across the board.vi More providers will likely follow suit.
Most importantly, the industry and open source community will experiment with the exciting new ideas that DeepSeek has brought to the table, integrating or adapting them for new models and techniques. The beauty of open source innovation is that a rising tide lifts all boats.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
[i] “Large language models are zero-shot reasoners,” arXiv, 24 May 2022
[ii] "Introducing Llama 3.1: Our most capable models to date," Meta, 24 July 2024
[iii] “Team Says They’ve Recreated DeepSeek’s OpenAI Killer for Literally $30," Futurism, 30 January 2025
[iv] “DeepSeek AI replicated for just $30 using Countdown game," The Independent, 3 February 2025
[v] "Berkeley Research Replicate DeepSeek R1’s Core Tech for Just $30," XYZ Labs, 26 January 2025
[vi] "Qwen2.5-Max: Exploring the Intelligence of Large-Scale MoE Model," Qwen, 28 January 2025