LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogDeepSeek V4 and the US AI Lab Squeeze
🏷️ DeepSeek🏷️ Open Models🏷️ AI Infrastructure🏷️ Agentic Coding🏢 Industry

DeepSeek V4 and the US AI Lab Squeeze

DeepSeek V4 pairs open weights, 1M context, and low hosted pricing with strong agentic coding results. The bigger story is what that does to closed API economics and US lab positioning.

LeetLLM TeamApril 27, 2026Updated June 11, 202620 min read
DeepSeek V4 and the US AI Lab Squeeze cover image

Imagine your team runs coding agents across a large engineering org. The agents summarize issues, inspect repositories, draft patches, and run tests. Last month, your API bill was $20,000 because every request went through a premium closed model, even though most tasks were routine repo scans or summaries.

DeepSeek V4 gives that team a lower-cost lane to benchmark before sending requests to a premium closed model.

DeepSeek-V4-Pro is a 1.6 trillion parameter mixture-of-experts model, but it activates about 49 billion parameters for any single token. DeepSeek-V4-Flash is the smaller 284 billion parameter version with 13 billion active parameters. Both ship with a 1 million token context window as the standard setting, not a special enterprise toggle.[1][2]

The release is open-weight and MIT-licensed, and the hosted API is priced below many premium closed frontier models. That combination puts pressure on the business model US labs have been defending: closed weights, premium API margins, and long-context pricing as a paid advantage.

This doesn't mean every company should self-host V4 tomorrow. It means the routing baseline changed again. Open-weight models are moving from fallback options to first-pass candidates in many workloads.

What DeepSeek Released

DeepSeek calls this a preview release, but the artifacts are concrete enough to evaluate today: open weights on Hugging Face, hosted API endpoints, OpenAI-compatible Chat Completions, Anthropic-compatible API access, and explicit setup guides for coding agents.[1][3]

ModelTotal parametersActive parametersContextIntended role
DeepSeek-V4-Pro1.6T49B1MHard reasoning, agentic coding, long-context work
DeepSeek-V4-Flash284B13B1MFast chat, routing, summaries, lower-cost agents

The model card says the instruct checkpoints use mixed precision: MoE expert weights are stored in FP4, while most other parameters use FP8. Base models are FP8 mixed.[2] That detail matters because V4 isn't a simple dense model where parameter count translates directly into active compute. Like other MoE systems, it has a large pool of capacity but routes each token through only a subset of experts.

Total parameters describe the full model's storage footprint, while active parameters describe the work done per token. V4-Pro is huge to store, but it doesn't run like a dense 1.6T model on every forward pass.

A dense model applies the same full parameter path to every token. A mixture-of-experts model has many expert blocks and a router that activates only a small subset for each token. For V4-Pro, the large total parameter count describes capacity, while the active parameter count describes the work done on one forward pass.

For a first-principles refresher, this is the core idea behind mixture-of-experts architecture: total parameters describe the full model, while active parameters describe the work done for each token.

Why a Million Tokens Only Matters If You Can Afford to Use Them

A million-token context window is only useful if the model can afford to use it.

Standard attention gets painful because every new token has to attend over a growing history, and the server has to keep a KV cache for that history. The KV cache stores the intermediate attention keys and values the model has already computed so it doesn't have to recompute them for every new token. At 1 million tokens, that cache can grow to tens of gigabytes.

In long-running agent work, that history can include instructions, code files, command outputs, tool results, stack traces, retrieved documents, and previous reasoning traces. The context window may be 1M tokens, but the cost of filling and reusing it can dominate the system.

DeepSeek V4 attacks that with a hybrid attention design. The model card describes Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA makes attention sparse around selected positions, while HCA aggressively compresses older or less relevant history. The headline claim is that V4-Pro at 1M context uses 27% of DeepSeek-V3.2's single-token inference FLOPs and 10% of its KV cache.[1][2]

That's the technical center of the release. The world already had long-context models. The hard part is making long context cheap enough for agents that keep running. A code agent can burn hundreds of tool turns. A research agent can keep appending documents and intermediate notes. An incident-analysis agent can carry prior traces and logs. If KV cache memory is the bottleneck, the product falls back to truncation, summarization, retrieval, or expensive hosted context.

V4's pitch is different: compress the history enough that a larger part of the raw working trace can stay in the model's context.

Dense long-context attention compared with DeepSeek V4 hybrid attention: a recent dense block, sparse selected positions, compressed HCA memory, and lower reported FLOPs and KV cache at 1M context. Dense long-context attention compared with DeepSeek V4 hybrid attention: a recent dense block, sparse selected positions, compressed HCA memory, and lower reported FLOPs and KV cache at 1M context.
V4's model card describes recent dense attention, sparse selected positions, and compressed memory. Bars show its release-reported 1M-context ratios relative to DeepSeek-V3.2.

Agentic Coding Is the Showcase

DeepSeek is aiming V4 directly at coding agents. The official release claims state-of-the-art results among open-weight models on agentic coding benchmarks and lists integrations with Claude Code, OpenClaw, and OpenCode.[1][3]

The model card gives a more useful view because it breaks out the benchmark table. V4-Pro-Max reports 80.6% on SWE Verified, 67.9 on Terminal Bench 2.0, and 73.6% on MCPAtlas Public. V4-Flash-Max is close on several agent tasks, with 79.0% on SWE Verified and 69.0% on MCPAtlas.[2]

Here's how V4-Pro-Max stacks up against the closed frontier comparators DeepSeek chose in its own model card. Those comparators are Claude Opus 4.6 (Max effort), OpenAI GPT-5.4 (xHigh effort), and Gemini 3.1 Pro (High effort):

BenchmarkOpus 4.6 MaxGPT-5.4 xHighGemini 3.1 Pro HighDS-V4-Pro Max
SWE Verified (%)80.8-80.680.6
Terminal Bench 2.065.475.168.567.9
MCPAtlas Public (%)73.867.269.273.6
SWE-Pro (%)57.357.754.255.4
LiveCodeBench (%)88.8-91.793.5

Read those numbers carefully. They come from DeepSeek's own release materials, and a vendor picks the comparators and effort settings that flatter its result. Treat them as claims to reproduce in your own harness, not as permanent truth. Still, the direction is worth testing: the open-weight model is now close enough that the evaluation question becomes workload-specific.

Do not take self-reported benchmarks at face value. Different evaluation harnesses, agent scaffolding, and retry strategies can shift results significantly. Always validate against your own workload before making routing decisions.

For a coding platform, the practical question is no longer only:

Which closed model fits this task?

It's now:

Which requests need the expensive closed model, which belong on V4-Pro, and which belong on V4-Flash?

That routing question is where cost engineering starts to look like product architecture.

The API Migration Is Small, but the Deadline Is Real

DeepSeek lists the hosted API as available in its release docs. Users can keep the same base_url and switch the model name to deepseek-v4-pro or deepseek-v4-flash. It also supports OpenAI Chat Completions and Anthropic API formats.[1]

There's one date to put on the migration calendar: deepseek-chat and deepseek-reasoner retire after July 24, 2026 at 15:59 UTC. DeepSeek says those names currently route to V4-Flash non-thinking and thinking modes, but they won't remain accessible after the retirement date.[1]

For teams already using DeepSeek as a low-cost reasoning lane, this isn't a cosmetic rename. Update model IDs, retest thinking mode behavior, and check any client code that assumes deepseek-reasoner is a stable model string. DeepSeek's Claude Code guide also uses deepseek-v4-pro[1m] for the Anthropic-compatible path, so copy the model ID from the integration guide for your client instead of assuming every tool uses the same spelling.[3]

Pin your model IDs explicitly, such as deepseek-v4-flash instead of deepseek-chat. Alias-based routing that depends on generic names will fail when the deprecation hits.

What the Price Gap Means for Engineers

V4 lands in a sensitive spot for US AI labs because it hits four pressure points at once. For a working engineer, the most important pressure point is cost.

1. The closed-frontier gap keeps shifting

The older story was cleaner: US labs had the frontier models, while open models were useful but clearly behind. DeepSeek keeps making that story harder to defend.

V4-Pro's official benchmark table places it near closed frontier systems on several coding and agentic tasks, and its long-context design isn't a copied feature checklist.[2] The release is also openly framed around cost-effective 1M context, one area where closed APIs have been able to charge a premium.

This doesn't mean DeepSeek beats every US model. It doesn't. The model card shows cases where Gemini, Claude, OpenAI, Kimi, or GLM lead. The important point is narrower: a non-US lab is repeatedly producing open-weight releases that force frontier comparisons. That changes the negotiation between developers and API vendors.

2. Open weights pressure closed pricing

Closed models still have real advantages: hosted reliability, safety work, multimodal product polish, enterprise controls, and fast access to the newest frontier releases. But open weights put a ceiling on what customers will tolerate for ordinary workloads.

DeepSeek's API page currently lists V4-Flash at $0.14 per 1M cache-miss input tokens and $0.28 per 1M output tokens. It lists V4-Pro at $0.435 input and $0.87 output per 1M tokens. Cache-hit input is lower for both models, and DeepSeek warns that prices may change, so date-stamp any buying decision.[4]

Compare that with current hosted pricing. OpenAI's pricing page lists GPT-5.4 at $2.50 input and $15.00 output per 1M text tokens.[5][6] Anthropic lists Claude Sonnet 4.6 at $3 input and $15 output per 1M tokens, and says Sonnet 4.6 includes 1M context at standard pricing.[7][8] Google's current Gemini pricing page lists Gemini 3.1 Pro Preview at $0.45 input and $2.70 output per 1M text tokens.[9] That keeps Gemini much closer to DeepSeek than GPT-5.4 or Sonnet 4.6 are, but Flash and V4-Pro still undercut it.

ProviderModelInput / 1M tokensOutput / 1M tokens
DeepSeekV4-Flash$0.14$0.28
DeepSeekV4-Pro$0.435$0.87
GoogleGemini 3.1 Pro Preview$0.45$2.70
OpenAIGPT-5.4$2.50$15.00
AnthropicClaude Sonnet 4.6$3.00$15.00
Fixed support-workload daily cost comparison showing DeepSeek V4-Flash and current V4-Pro pricing below premium hosted model lanes, with reminder to keep cheap routing as baseline and escalate only when quality or policy requires it. Fixed support-workload daily cost comparison showing DeepSeek V4-Flash and current V4-Pro pricing below premium hosted model lanes, with reminder to keep cheap routing as baseline and escalate only when quality or policy requires it.
The table has exact token prices. This visual teaches routing shape instead: cheap lane first, premium lane only when evals or policy constraints justify it.

That isn't an apples-to-apples quality comparison. It's a margin comparison. If V4-Flash meets quality requirements for routing, summarization, codebase Q&A, or low-risk agent steps, the expensive model needs to justify each escalation.

For scale, suppose your support agent processes 2 million input tokens and 500,000 output tokens per day. With Claude Sonnet 4.6, that costs roughly $13.50 per day. With V4-Flash, it costs roughly $0.42 per day. Over a 30-day month, the difference is about $405 versus $13. Quality won't be identical, but the gap is large enough to validate before assuming the premium lane is necessary.

3. Lower-cost 1M context changes product design

Long context used to be something you saved for premium flows because it was expensive and operationally awkward. V4 makes a different design pattern more plausible:

  1. Put the whole repository, transcript, or document set into context when the task benefits from locality.
  2. Use retrieval and compaction as optimization tools, not as mandatory workarounds.
  3. Route routine steps to Flash and escalate hard steps to Pro or a closed model.

That matters for AI engineering because context policy becomes a product decision. You can ask whether a user pays for a more complete working set, whether cached context gets reused across sessions, and whether the agent carries raw tool traces instead of summarizing them too aggressively.

The LeetLLM deep dive on million-token context windows covers the broader point: context length isn't just a number in a model card. It changes memory, latency, evaluation, and cost.

4. Hardware efficiency is a strategic weapon

US labs still have massive training clusters. That matters. But V4 points at a different kind of advantage: getting more useful inference out of the same serving fleet.

NVIDIA summarizes V4's architecture as reducing per-token inference FLOPs and KV cache memory burden relative to DeepSeek-V3.2, and reports early GB200 NVL72 tests for V4-Pro at more than 150 tokens per second per user.[10] SGLang and vLLM, two common inference runtimes, both published Day 0 recipes for serving V4 on Hopper and Blackwell-class systems.[11][12]

This is why sparse attention matters commercially. If a lab can reduce memory pressure and keep long-context agents on fewer GPUs, it can sell lower prices, serve more users, or spend the same budget on harder tasks.

What It Takes to Run DeepSeek V4 Yourself

The open-weight part is real, but it doesn't make V4-Pro a laptop model.

Two separate hardware questions matter:

  1. Can you load the weights? Total parameters drive storage and VRAM pressure.
  2. Can you serve useful traffic? Active parameters, KV cache, batching, networking, and the context length drive throughput and latency.

The official Hugging Face model card lists V4-Pro at 1.6T total parameters with 49B active parameters, and V4-Flash at 284B total parameters with 13B active parameters. It also says the instruct weights use FP4 for MoE experts and FP8 for most other parameters.[2] That explains why the storage footprint is much lower than a naive FP16 calculation, but it's still large.

Deployment targetPractical read
V4-Flash experimentationDatacenter workstation or single server class, not a 24 GB gaming GPU
V4-Flash productionSGLang lists single-node serving on 4 GPUs for B200, GB200, GB300, or H200 platforms[11]
V4-Pro experimentationMulti-GPU datacenter hardware with fast interconnect
V4-Pro productionSGLang lists B200 8 GPU, GB200 8 GPU across 2 nodes, GB300 4 GPU, or H200 8 GPU for FP4 checkpoints / 16 GPU for converted FP8 checkpoints[11]
V4-Pro with vLLMvLLM lists B300 8 GPU, H200 8 GPU with context capped at 800K, or two GB200 NVL4 trays for 8 GPUs total[12]

Do not assume V4-Pro can run on a laptop. V4-Flash is the realistic self-hosting candidate for many infrastructure teams, and even that needs a datacenter-class server. V4-Pro is a cluster deployment. The open-weight label does not remove storage, interconnect, KV cache, batching, or operations costs.

The exact recipe depends on runtime, checkpoint layout, context length, and parallelism strategy. The important point is that V4-Flash is the realistic self-hosting candidate for many infrastructure teams. V4-Pro is a cluster deployment.

This is also where V4 differs from the Llama comparison people usually make. Llama 4 Scout is 109B total, 17B active, and NVIDIA says an INT4 (4-bit) optimized Scout can run on a single H100. Llama 4 Maverick is 400B total, 17B active, with a 1M context window.[13] Llama 3.1 405B is a dense 405B model with 128K context.[14]

DeepSeek-V4-Flash is larger than Scout in total parameters but has only 13B active parameters. V4-Pro is in another storage class entirely. Its advantage isn't that it's easy to self-host. Its advantage is that it gives you open weights and competitive agent results if you have the infrastructure to run it.

For closed models, the comparison is different. You don't get a self-hosting plan. You get an API bill, rate limits, enterprise terms, and whatever context and caching behavior the vendor exposes. That's often the right tradeoff, especially for teams without inference engineers. But V4 gives larger teams another option.

Quantization Helps, but It Doesn't Change the Category

V4 already ships in an aggressive mixed-precision format for instruct models. The model card describes FP4 MoE expert weights and FP8 for most other parameters, while the local inference README says you can switch experts to FP8 by changing config and conversion options.[2]

That means the usual local-model intuition needs care. With a Llama 70B dense model, moving from FP16 to 4-bit can be the difference between a server GPU and a high-end desktop. With V4, the official instruct checkpoint has already taken a big precision step. Community GGUF or lower-bit conversions may appear, but quality, tool-calling behavior, and 1M-context performance need validation before production use.

A good operator stance is:

  • Use DeepSeek's API first to evaluate quality and routing value.
  • Use V4-Flash for self-hosting pilots if your workload has steady volume.
  • Treat V4-Pro self-hosting as a cluster project, not a developer workstation project.
  • Benchmark at the context length you actually need. A model that fits at 32K may not fit your 384K or 1M agent workload.

For serving fundamentals, KV cache and PagedAttention, continuous batching, and LLM cost engineering are the concepts to understand before you buy GPUs.

Inference Cost Is Not Training Cost

One common mistake is to mix training economics and inference economics.

Training creates the model. Inference runs the model for users. The capital requirements, optimization targets, and accounting are different.

DeepSeek says V4 was pretrained on more than 32T tokens.[2] That's a lab-scale training project. Meta's Llama 3 paper reports that its 405B model was pretrained with 3.8×10253.8 \times 10^{25}3.8×1025 FLOPs on 15.6T text tokens, which gives a useful sense of how expensive frontier-scale pretraining can get.[14]

Most companies aren't choosing whether to train DeepSeek V4. They're choosing whether to:

  • call DeepSeek's hosted API
  • self-host V4-Flash
  • self-host V4-Pro
  • keep using closed US frontier APIs
  • route across all of the above

That's an inference decision. The cost model should focus on tokens, cache hit rate, utilization, latency target, staff time, and GPU reservations.

On June 11, 2026, DeepSeek's official pricing page lists V4-Pro at $0.435 per 1M cache-miss input tokens and $0.87 per 1M output tokens. At that price, teams can run agentic benchmarks against their own codebases before committing to infrastructure.

For low or bursty traffic, the API is usually the rational starting point. For steady, high-volume traffic with privacy or customization needs, self-hosting can make sense. The break-even point depends less on the model's benchmark score and more on whether your GPUs stay busy.

Real Impact: Routing Becomes Default

The most practical production response to DeepSeek V4 isn't ideological.

It's routing.

A practical 2026 router makes lane defaults explicit before any provider call:

Request typeExampleDefault model lane
Low-risk chat, summaries, classification"Summarize these issue comments."V4-Flash or another efficient open model
Medium agent steps with long context"Find every failing test related to this migration."V4-Flash with long context
Hard reasoning, multi-step coding, large refactors"Refactor the auth service to use the new token verifier."V4-Pro
Highest-risk reasoning, multimodal, or enterprise-policy flows"Evaluate this security policy change for compliance impact."Closed frontier API
Private data residency workloads"Analyze internal incident logs on-premise."Self-hosted open-weight model

The pattern: classify once, send most traffic through the lowest-cost lane that passes quality, and keep a premium lane for hard or high-risk cases.

Routing quadrant places chat and repo scans in V4-Flash, refactors in V4-Pro, EU data on self-hosted models, and contracts on frontier APIs. Routing quadrant places chat and repo scans in V4-Flash, refactors in V4-Pro, EU data on self-hosted models, and contracts on frontier APIs.
Move right as capability demands rise. Move up when privacy, residency, or policy constraints dominate.

That's pricing pressure for any lab whose business assumes every token goes through a premium closed model. Strong closed models will still command premium pricing for hard tasks. But V4 makes it easier to avoid premium prices for easy and medium tasks.

Why Routing Matters More Than Any Single Model

DeepSeek V4 isn't the end of US AI leadership. It's a clear reminder that leadership is no longer measured only by who has the biggest closed model.

The important pieces are practical:

  • V4-Pro gives open-weight users a strong agentic coding model with 1M context.
  • V4-Flash gives builders a lower-cost long-context lane that can cover many routine agent steps.
  • The API is available now and works with OpenAI and Anthropic-style clients.
  • The old deepseek-chat and deepseek-reasoner names retire on July 24, 2026.
  • Self-hosting V4-Pro is a datacenter project, while V4-Flash is the more realistic first deployment target.

For US labs, the lesson is practical: if a competitor can release open weights, support 1M context, integrate with coding agents, and sell the hosted version at a fraction of premium API pricing, closed-source models need to win on more than habit. They need to win on capability, reliability, product polish, enterprise trust, and total cost for the specific workload.

That's a much harder market than the one closed labs enjoyed two years ago.

Mastery check

  1. Why can a 1.6T MoE model still be a plausible inference candidate when a dense 1.6T model would not be?
  2. Why is a 1M context window only valuable if the serving design also reduces KV-cache and long-context compute burden?
  3. When should a product route to V4-Flash, V4-Pro, or a closed model instead of picking one global default?

Key concepts

  • Total parameters and active parameters are different. Storage footprint and per-token compute are not the same question.
  • Long context matters only if memory and inference cost stay manageable enough to keep raw working traces in play.
  • Vendor benchmark tables are starting points for evaluation, not proof of production quality.
  • Open-weight models create pricing pressure even when premium closed models still lead on some hardest tasks.
  • Good production design is usually routing design, not single-model loyalty.

Evaluation rubric

  • Strong answer explains MoE routing in plain English, names active parameters, and avoids equating total size with per-token work.
  • Strong answer explains why long-context value depends on KV-cache, latency, and cost, not only the headline context number.
  • Strong answer proposes a routing policy with explicit quality and policy gates instead of saying "always use cheapest" or "always use best."

Follow-up questions

  1. If your repo agent handles mostly low-risk summaries but occasionally needs multi-file refactors, which lane should be default and what exact signal should trigger escalation?
  2. Suppose V4-Flash matches closed models on your summary eval but fails on multi-file code edits. How would you change routing without throwing away its cost advantage?
  3. If a vendor claims near-parity on agent benchmarks, what three workload-local checks would you run before moving premium traffic off the closed model?

Common pitfalls

  • Symptom: You assume V4-Pro is "too big to matter" because 1.6T looks impossible. Cause: You treated total parameters as active per-token compute. Fix: Separate storage footprint, active experts, and serving topology before judging feasibility.
  • Symptom: You treat 1M context as an automatic product win. Cause: You ignored KV-cache cost, latency, and memory pressure. Fix: Benchmark at the context length your agent really uses, not only at short prompts.
  • Symptom: You rewrite your whole stack around a vendor benchmark table. Cause: You treated self-reported numbers as deployment proof. Fix: Reproduce results on your own workload and measure cost, quality, and failure modes together.
  • Symptom: You route every request through the premium lane "to be safe." Cause: Product architecture still defaults to vendor habit instead of measured escalation. Fix: Make cheap lane the baseline and define explicit policy or quality triggers for stepping up.

If you can answer the mastery check cleanly, you're ready to go deeper on KV cache and PagedAttention, continuous batching, and LLM cost engineering.

PreviousRun Qwen3.6 Locally with Unsloth GGUFNextBest AI Plans for OpenClaw in 2026
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

DeepSeek V4 Preview Release

DeepSeek · 2026

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI · 2026

Integrate with AI Tools

DeepSeek · 2026

Models and Pricing

DeepSeek · 2026

OpenAI API Pricing

OpenAI · 2026

GPT-5.4 Model

OpenAI · 2026

Anthropic Model Pricing

Anthropic · 2026

Context windows

Anthropic · 2026

Gemini API Pricing

Google · 2026

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints

NVIDIA · 2026

DeepSeek-V4

SGLang · 2026

DeepSeek-V4-Pro vLLM Recipe

vLLM · 2026

NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick

NVIDIA · 2025

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint