DeepSeek V4 pairs open weights, 1M context, and low hosted pricing with strong agentic coding results. The bigger story is what that does to closed API economics and US lab positioning.

Imagine your team runs coding agents across a large engineering org. The agents summarize issues, inspect repositories, draft patches, and run tests. Last month, your API bill was $20,000 because every request went through a premium closed model, even though most tasks were routine repo scans or summaries.
DeepSeek V4 gives that team a lower-cost lane to benchmark before sending requests to a premium closed model.
DeepSeek-V4-Pro is a 1.6 trillion parameter mixture-of-experts model, but it activates about 49 billion parameters for any single token. DeepSeek-V4-Flash is the smaller 284 billion parameter version with 13 billion active parameters. Both ship with a 1 million token context window as the standard setting, not a special enterprise toggle.[1][2]
The release is open-weight and MIT-licensed, and the hosted API is priced below many premium closed frontier models. That combination puts pressure on the business model US labs have been defending: closed weights, premium API margins, and long-context pricing as a paid advantage.
This doesn't mean every company should self-host V4 tomorrow. It means the routing baseline changed again. Open-weight models are moving from fallback options to first-pass candidates in many workloads.
DeepSeek calls this a preview release, but the artifacts are concrete enough to evaluate today: open weights on Hugging Face, hosted API endpoints, OpenAI-compatible Chat Completions, Anthropic-compatible API access, and explicit setup guides for coding agents.[1][3]
| Model | Total parameters | Active parameters | Context | Intended role |
|---|---|---|---|---|
| DeepSeek-V4-Pro | 1.6T | 49B | 1M | Hard reasoning, agentic coding, long-context work |
| DeepSeek-V4-Flash | 284B | 13B | 1M | Fast chat, routing, summaries, lower-cost agents |
The model card says the instruct checkpoints use mixed precision: MoE expert weights are stored in FP4, while most other parameters use FP8. Base models are FP8 mixed.[2] That detail matters because V4 isn't a simple dense model where parameter count translates directly into active compute. Like other MoE systems, it has a large pool of capacity but routes each token through only a subset of experts.
Total parameters describe the full model's storage footprint, while active parameters describe the work done per token. V4-Pro is huge to store, but it doesn't run like a dense 1.6T model on every forward pass.
A dense model applies the same full parameter path to every token. A mixture-of-experts model has many expert blocks and a router that activates only a small subset for each token. For V4-Pro, the large total parameter count describes capacity, while the active parameter count describes the work done on one forward pass.
For a first-principles refresher, this is the core idea behind mixture-of-experts architecture: total parameters describe the full model, while active parameters describe the work done for each token.
A million-token context window is only useful if the model can afford to use it.
Standard attention gets painful because every new token has to attend over a growing history, and the server has to keep a KV cache for that history. The KV cache stores the intermediate attention keys and values the model has already computed so it doesn't have to recompute them for every new token. At 1 million tokens, that cache can grow to tens of gigabytes.
In long-running agent work, that history can include instructions, code files, command outputs, tool results, stack traces, retrieved documents, and previous reasoning traces. The context window may be 1M tokens, but the cost of filling and reusing it can dominate the system.
DeepSeek V4 attacks that with a hybrid attention design. The model card describes Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA makes attention sparse around selected positions, while HCA aggressively compresses older or less relevant history. The headline claim is that V4-Pro at 1M context uses 27% of DeepSeek-V3.2's single-token inference FLOPs and 10% of its KV cache.[1][2]
That's the technical center of the release. The world already had long-context models. The hard part is making long context cheap enough for agents that keep running. A code agent can burn hundreds of tool turns. A research agent can keep appending documents and intermediate notes. An incident-analysis agent can carry prior traces and logs. If KV cache memory is the bottleneck, the product falls back to truncation, summarization, retrieval, or expensive hosted context.
V4's pitch is different: compress the history enough that a larger part of the raw working trace can stay in the model's context.
DeepSeek is aiming V4 directly at coding agents. The official release claims state-of-the-art results among open-weight models on agentic coding benchmarks and lists integrations with Claude Code, OpenClaw, and OpenCode.[1][3]
The model card gives a more useful view because it breaks out the benchmark table. V4-Pro-Max reports 80.6% on SWE Verified, 67.9 on Terminal Bench 2.0, and 73.6% on MCPAtlas Public. V4-Flash-Max is close on several agent tasks, with 79.0% on SWE Verified and 69.0% on MCPAtlas.[2]
Here's how V4-Pro-Max stacks up against the closed frontier comparators DeepSeek chose in its own model card. Those comparators are Claude Opus 4.6 (Max effort), OpenAI GPT-5.4 (xHigh effort), and Gemini 3.1 Pro (High effort):
| Benchmark | Opus 4.6 Max | GPT-5.4 xHigh | Gemini 3.1 Pro High | DS-V4-Pro Max |
|---|---|---|---|---|
| SWE Verified (%) | 80.8 | - | 80.6 | 80.6 |
| Terminal Bench 2.0 | 65.4 | 75.1 | 68.5 | 67.9 |
| MCPAtlas Public (%) | 73.8 | 67.2 | 69.2 | 73.6 |
| SWE-Pro (%) | 57.3 | 57.7 | 54.2 | 55.4 |
| LiveCodeBench (%) | 88.8 | - | 91.7 | 93.5 |
Read those numbers carefully. They come from DeepSeek's own release materials, and a vendor picks the comparators and effort settings that flatter its result. Treat them as claims to reproduce in your own harness, not as permanent truth. Still, the direction is worth testing: the open-weight model is now close enough that the evaluation question becomes workload-specific.
Do not take self-reported benchmarks at face value. Different evaluation harnesses, agent scaffolding, and retry strategies can shift results significantly. Always validate against your own workload before making routing decisions.
For a coding platform, the practical question is no longer only:
Which closed model fits this task?
It's now:
Which requests need the expensive closed model, which belong on V4-Pro, and which belong on V4-Flash?
That routing question is where cost engineering starts to look like product architecture.
DeepSeek lists the hosted API as available in its release docs. Users can keep the same base_url and switch the model name to deepseek-v4-pro or deepseek-v4-flash. It also supports OpenAI Chat Completions and Anthropic API formats.[1]
There's one date to put on the migration calendar: deepseek-chat and deepseek-reasoner retire after July 24, 2026 at 15:59 UTC. DeepSeek says those names currently route to V4-Flash non-thinking and thinking modes, but they won't remain accessible after the retirement date.[1]
For teams already using DeepSeek as a low-cost reasoning lane, this isn't a cosmetic rename. Update model IDs, retest thinking mode behavior, and check any client code that assumes deepseek-reasoner is a stable model string. DeepSeek's Claude Code guide also uses deepseek-v4-pro[1m] for the Anthropic-compatible path, so copy the model ID from the integration guide for your client instead of assuming every tool uses the same spelling.[3]
Pin your model IDs explicitly, such as deepseek-v4-flash instead of deepseek-chat. Alias-based routing that depends on generic names will fail when the deprecation hits.
V4 lands in a sensitive spot for US AI labs because it hits four pressure points at once. For a working engineer, the most important pressure point is cost.
The older story was cleaner: US labs had the frontier models, while open models were useful but clearly behind. DeepSeek keeps making that story harder to defend.
V4-Pro's official benchmark table places it near closed frontier systems on several coding and agentic tasks, and its long-context design isn't a copied feature checklist.[2] The release is also openly framed around cost-effective 1M context, one area where closed APIs have been able to charge a premium.
This doesn't mean DeepSeek beats every US model. It doesn't. The model card shows cases where Gemini, Claude, OpenAI, Kimi, or GLM lead. The important point is narrower: a non-US lab is repeatedly producing open-weight releases that force frontier comparisons. That changes the negotiation between developers and API vendors.
Closed models still have real advantages: hosted reliability, safety work, multimodal product polish, enterprise controls, and fast access to the newest frontier releases. But open weights put a ceiling on what customers will tolerate for ordinary workloads.
DeepSeek's API page currently lists V4-Flash at $0.14 per 1M cache-miss input tokens and $0.28 per 1M output tokens. It lists V4-Pro at $0.435 input and $0.87 output per 1M tokens. Cache-hit input is lower for both models, and DeepSeek warns that prices may change, so date-stamp any buying decision.[4]
Compare that with current hosted pricing. OpenAI's pricing page lists GPT-5.4 at $2.50 input and $15.00 output per 1M text tokens.[5][6] Anthropic lists Claude Sonnet 4.6 at $3 input and $15 output per 1M tokens, and says Sonnet 4.6 includes 1M context at standard pricing.[7][8] Google's current Gemini pricing page lists Gemini 3.1 Pro Preview at $0.45 input and $2.70 output per 1M text tokens.[9] That keeps Gemini much closer to DeepSeek than GPT-5.4 or Sonnet 4.6 are, but Flash and V4-Pro still undercut it.
| Provider | Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|---|
| DeepSeek | V4-Flash | $0.14 | $0.28 |
| DeepSeek | V4-Pro | $0.435 | $0.87 |
| Gemini 3.1 Pro Preview | $0.45 | $2.70 | |
| OpenAI | GPT-5.4 | $2.50 | $15.00 |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 |
That isn't an apples-to-apples quality comparison. It's a margin comparison. If V4-Flash meets quality requirements for routing, summarization, codebase Q&A, or low-risk agent steps, the expensive model needs to justify each escalation.
For scale, suppose your support agent processes 2 million input tokens and 500,000 output tokens per day. With Claude Sonnet 4.6, that costs roughly $13.50 per day. With V4-Flash, it costs roughly $0.42 per day. Over a 30-day month, the difference is about $405 versus $13. Quality won't be identical, but the gap is large enough to validate before assuming the premium lane is necessary.
Long context used to be something you saved for premium flows because it was expensive and operationally awkward. V4 makes a different design pattern more plausible:
That matters for AI engineering because context policy becomes a product decision. You can ask whether a user pays for a more complete working set, whether cached context gets reused across sessions, and whether the agent carries raw tool traces instead of summarizing them too aggressively.
The LeetLLM deep dive on million-token context windows covers the broader point: context length isn't just a number in a model card. It changes memory, latency, evaluation, and cost.
US labs still have massive training clusters. That matters. But V4 points at a different kind of advantage: getting more useful inference out of the same serving fleet.
NVIDIA summarizes V4's architecture as reducing per-token inference FLOPs and KV cache memory burden relative to DeepSeek-V3.2, and reports early GB200 NVL72 tests for V4-Pro at more than 150 tokens per second per user.[10] SGLang and vLLM, two common inference runtimes, both published Day 0 recipes for serving V4 on Hopper and Blackwell-class systems.[11][12]
This is why sparse attention matters commercially. If a lab can reduce memory pressure and keep long-context agents on fewer GPUs, it can sell lower prices, serve more users, or spend the same budget on harder tasks.
The open-weight part is real, but it doesn't make V4-Pro a laptop model.
Two separate hardware questions matter:
The official Hugging Face model card lists V4-Pro at 1.6T total parameters with 49B active parameters, and V4-Flash at 284B total parameters with 13B active parameters. It also says the instruct weights use FP4 for MoE experts and FP8 for most other parameters.[2] That explains why the storage footprint is much lower than a naive FP16 calculation, but it's still large.
| Deployment target | Practical read |
|---|---|
| V4-Flash experimentation | Datacenter workstation or single server class, not a 24 GB gaming GPU |
| V4-Flash production | SGLang lists single-node serving on 4 GPUs for B200, GB200, GB300, or H200 platforms[11] |
| V4-Pro experimentation | Multi-GPU datacenter hardware with fast interconnect |
| V4-Pro production | SGLang lists B200 8 GPU, GB200 8 GPU across 2 nodes, GB300 4 GPU, or H200 8 GPU for FP4 checkpoints / 16 GPU for converted FP8 checkpoints[11] |
| V4-Pro with vLLM | vLLM lists B300 8 GPU, H200 8 GPU with context capped at 800K, or two GB200 NVL4 trays for 8 GPUs total[12] |
Do not assume V4-Pro can run on a laptop. V4-Flash is the realistic self-hosting candidate for many infrastructure teams, and even that needs a datacenter-class server. V4-Pro is a cluster deployment. The open-weight label does not remove storage, interconnect, KV cache, batching, or operations costs.
The exact recipe depends on runtime, checkpoint layout, context length, and parallelism strategy. The important point is that V4-Flash is the realistic self-hosting candidate for many infrastructure teams. V4-Pro is a cluster deployment.
This is also where V4 differs from the Llama comparison people usually make. Llama 4 Scout is 109B total, 17B active, and NVIDIA says an INT4 (4-bit) optimized Scout can run on a single H100. Llama 4 Maverick is 400B total, 17B active, with a 1M context window.[13] Llama 3.1 405B is a dense 405B model with 128K context.[14]
DeepSeek-V4-Flash is larger than Scout in total parameters but has only 13B active parameters. V4-Pro is in another storage class entirely. Its advantage isn't that it's easy to self-host. Its advantage is that it gives you open weights and competitive agent results if you have the infrastructure to run it.
For closed models, the comparison is different. You don't get a self-hosting plan. You get an API bill, rate limits, enterprise terms, and whatever context and caching behavior the vendor exposes. That's often the right tradeoff, especially for teams without inference engineers. But V4 gives larger teams another option.
V4 already ships in an aggressive mixed-precision format for instruct models. The model card describes FP4 MoE expert weights and FP8 for most other parameters, while the local inference README says you can switch experts to FP8 by changing config and conversion options.[2]
That means the usual local-model intuition needs care. With a Llama 70B dense model, moving from FP16 to 4-bit can be the difference between a server GPU and a high-end desktop. With V4, the official instruct checkpoint has already taken a big precision step. Community GGUF or lower-bit conversions may appear, but quality, tool-calling behavior, and 1M-context performance need validation before production use.
A good operator stance is:
For serving fundamentals, KV cache and PagedAttention, continuous batching, and LLM cost engineering are the concepts to understand before you buy GPUs.
One common mistake is to mix training economics and inference economics.
Training creates the model. Inference runs the model for users. The capital requirements, optimization targets, and accounting are different.
DeepSeek says V4 was pretrained on more than 32T tokens.[2] That's a lab-scale training project. Meta's Llama 3 paper reports that its 405B model was pretrained with FLOPs on 15.6T text tokens, which gives a useful sense of how expensive frontier-scale pretraining can get.[14]
Most companies aren't choosing whether to train DeepSeek V4. They're choosing whether to:
That's an inference decision. The cost model should focus on tokens, cache hit rate, utilization, latency target, staff time, and GPU reservations.
On June 11, 2026, DeepSeek's official pricing page lists V4-Pro at $0.435 per 1M cache-miss input tokens and $0.87 per 1M output tokens. At that price, teams can run agentic benchmarks against their own codebases before committing to infrastructure.
For low or bursty traffic, the API is usually the rational starting point. For steady, high-volume traffic with privacy or customization needs, self-hosting can make sense. The break-even point depends less on the model's benchmark score and more on whether your GPUs stay busy.
The most practical production response to DeepSeek V4 isn't ideological.
It's routing.
A practical 2026 router makes lane defaults explicit before any provider call:
| Request type | Example | Default model lane |
|---|---|---|
| Low-risk chat, summaries, classification | "Summarize these issue comments." | V4-Flash or another efficient open model |
| Medium agent steps with long context | "Find every failing test related to this migration." | V4-Flash with long context |
| Hard reasoning, multi-step coding, large refactors | "Refactor the auth service to use the new token verifier." | V4-Pro |
| Highest-risk reasoning, multimodal, or enterprise-policy flows | "Evaluate this security policy change for compliance impact." | Closed frontier API |
| Private data residency workloads | "Analyze internal incident logs on-premise." | Self-hosted open-weight model |
The pattern: classify once, send most traffic through the lowest-cost lane that passes quality, and keep a premium lane for hard or high-risk cases.
That's pricing pressure for any lab whose business assumes every token goes through a premium closed model. Strong closed models will still command premium pricing for hard tasks. But V4 makes it easier to avoid premium prices for easy and medium tasks.
DeepSeek V4 isn't the end of US AI leadership. It's a clear reminder that leadership is no longer measured only by who has the biggest closed model.
The important pieces are practical:
deepseek-chat and deepseek-reasoner names retire on July 24, 2026.For US labs, the lesson is practical: if a competitor can release open weights, support 1M context, integrate with coding agents, and sell the hosted version at a fraction of premium API pricing, closed-source models need to win on more than habit. They need to win on capability, reliability, product polish, enterprise trust, and total cost for the specific workload.
That's a much harder market than the one closed labs enjoyed two years ago.
If you can answer the mastery check cleanly, you're ready to go deeper on KV cache and PagedAttention, continuous batching, and LLM cost engineering.
DeepSeek V4 Preview Release
DeepSeek · 2026
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI · 2026
Integrate with AI Tools
DeepSeek · 2026
Models and Pricing
DeepSeek · 2026
OpenAI API Pricing
OpenAI · 2026
GPT-5.4 Model
OpenAI · 2026
Anthropic Model Pricing
Anthropic · 2026
Context windows
Anthropic · 2026
Gemini API Pricing
Google · 2026
Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints
NVIDIA · 2026
DeepSeek-V4
SGLang · 2026
DeepSeek-V4-Pro vLLM Recipe
vLLM · 2026
NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick
NVIDIA · 2025
The Llama 3 Herd of Models.
Dubey, A., et al. · 2024 · arXiv preprint