Learn Recursive Language Models (RLMs): keep long context in a programmable environment, delegate targeted sub-calls, and release the design only after measured quality, cost, and safety checks.
In the last chapter, DSPy treated prompt design as a program you can optimize. Recursive Language Models (RLMs) keep that programming instinct, but move it into inference itself: a model writes code, stores long input outside its active context, and may issue targeted plain-LM or child-RLM calls.[1]
Imagine trying to find one carrier exception across ten thousand pages of merchant policies, tracking events, and fulfillment runbooks. A standard call faces a bad trade-off: push as much text as possible into its context window and risk losing focus, or summarize aggressively and risk dropping the clause you need. RLM is an inference-time control pattern for contexts that are too large or too information-dense for one direct pass.
RLM isn't a bigger model or a new attention layer. It's a different inference interface: the model sees compact state, writes code, and delegates work through controlled sub-calls.
The paper and the accompanying blog post both frame this as inference-time scaling for long-context reasoning, not as a new foundation-model architecture.[1][2]
In this article, "RLM" means the modern long-context inference scaffold introduced by Zhang, Kraska, and Khattab.[1]
Picture a dispatcher with a tiny workbench. A truck arrives with one thousand carrier-policy binders, and the dispatcher must answer questions about all of them.
A standard LLM is like trying to stack every binder on that tiny workbench at once. The pile overflows. Details from the first binders bleed into the last binders. This is what practitioners call context rot (the degradation of attention quality and fact recall as input length grows, even within the advertised window).[3] The RLM paper adopts the same term to motivate its design.[1]
One possible RLM trajectory is like giving the dispatcher a work log and a controlled intake window. It can inspect a few binders, search for carrier names, batch related pages for review, and save compact findings. A fixed summarize-every-binder tree is a useful baseline, but an RLM's distinguishing feature is that the model can choose the program while it works.
The work log is the external execution environment. Requesting slices and combining their results is one recursive decomposition; an RLM lets the controller choose and revise that decomposition in code. The model doesn't read a thousand binders at once, but the work log preserves the thread.
It's tempting to think long-context performance is only about maximum window size. The paper argues that view is incomplete.[1]
Two prompts can have the same length and very different reasoning difficulty. A single needle-in-a-haystack lookup is usually easy. Aggregating semantics over most lines is harder. Aggregating over most pairs of lines is much harder.
Think of a 10,000-page fulfillment manual. One question asks: "What is the return window for electronics?" That's a single fact hidden in one page. A model only needs to find that needle. Another question asks: "Summarize every carrier policy change in 2024." The answer is spread across thousands of pages, so the model must aggregate roughly 10,000 facts. A third question asks: "List every pair of carriers whose liability clauses conflict." Now the model must compare almost every page with almost every other page, creating about 50 million pairs to check.
A useful mental model is the paper's approximate information complexity categories as context length grows:
We write this with Big-Theta notation:
where describes answer-relevant processing work and the Greek letter describes its growth rate.[1] This isn't a runtime guarantee: retrieval choices, batching, and model errors still determine real latency and quality.
Turn that scaling intuition into concrete counts before choosing a controller:
1def unordered_pairs(items: int) -> int:
2 return items * (items - 1) // 2
3
4small = unordered_pairs(5_000)
5large = unordered_pairs(10_000)
6
7print(f"pairs_at_5000={small:,}")
8print(f"pairs_at_10000={large:,}")
9print(f"doubling_ratio={large / small:.2f}x")1pairs_at_5000=12,497,500
2pairs_at_10000=49,995,000
3doubling_ratio=4.00x
As complexity rises, direct model performance degrades faster, even before you hit hard context limits.[1] The important point is that throwing a longer at a task doesn't solve the hard part. Performance degrades because the model must track a rapidly growing web of relationships, not only because it runs out of tokens.
This is where RLM fits. It tries to keep neural context focused while moving large-scale symbolic manipulation into an external execution environment.
Key insight: Raw count isn't enough to choose a long-context strategy. A pairwise carrier-comparison task can overwhelm a direct call even when all text fits, so compare direct, REPL-only, and recursive paths on held-out tasks.
Common mistake: Treating maximum context length as maximum reasoning capability. "My model supports 1M tokens" doesn't mean "my model can reason well over any 1M-token task."
Start with a base model with context limit . Given a prompt string where , an RLM scaffold initializes an environment for the run (typically a Python REPL, which stands for Read-Eval-Print Loop) and stores as data inside that environment.[1]
The root model doesn't read all of directly. It receives compact metadata (length, chunk structure, maybe a prefix) and writes code to inspect, transform, and recurse.
One subtlety matters. The paper distinguishes depth 0 (REPL access without sub-calls), depth 1 (sub-calling LMs), and depth >1 (sub-calling RLMs).[1] In the current open-source runtime, llm_query() and llm_query_batched() issue plain LM completions, while rlm_query() and rlm_query_batched() spawn child RLM work when recursion is enabled.[4] Depth, sub-call count, tokens, cost, and time are budgets to measure, not evidence that more recursion is better.
At a high level, the RLM architecture coordinates a root planning loop with recursive execution. The root model receives metadata and decides on a sequence of actions, which are executed in the external environment. This environment can recursively call sub-models to process specific slices of context, as illustrated below:
This scaffold aims for three capabilities beyond a single direct call:
These are design capabilities, not automatic correctness guarantees. For a year of warehouse transaction logs, the scaffold can keep raw rows outside model context while extracting candidate negative-inventory adjustments, but a release check still has to test recall against labeled cases.
| Feature | Standard Context Window | Recursive Language Model (RLM) |
|---|---|---|
| Memory constraint | Bound by GPU VRAM (Video RAM) and (Key-Value cache) limits | External environment and budgets; no single-call input-window requirement |
| Root model context | All tokens must fit within window | Only metadata + compact summaries within ; but must be managed via selective history summarization |
| Compute allocation | One attention pass over provided tokens | Controller may inspect, transform, or delegate selected slices |
| Information density | May miss scattered relationships | Can improve dense aggregation when decomposition works |
| Failure mode | Missing or diluted evidence | Missing evidence plus protocol, budget, or execution failures |
The paper highlights a subtle but important boundary between RLM and traditional tool-use agents. Many existing agents have tools, sub-agents, or summarization capabilities, but they often still keep user prompt handling tied to the main model context. Consequently, they eventually inherit context-window bottlenecks.[1]
RLM's design goal is to keep the prompt external and make prompt processing itself programmable. It turns the prompt from a static input into a dynamic, queryable database.
Before letting a model write its own inspection program, build a fixed baseline. Suppose you have a 40,000-token carrier compliance manual, and your model's context limit is 4,096 tokens. You want a one-page summary of every policy that affects returns.
A static split-and-merge summarizer can keep every leaf call within the window. It is not yet an RLM: no root model is inspecting a REPL, choosing searches, or changing decomposition after seeing intermediate evidence. It does teach two costs an RLM must control: sub-call growth and information lost during merging.
The runnable toy version below uses character counts and a fake LM so you can test recursion logic without a tokenizer or API keys. A real implementation must split on token and document boundaries, then measure whether key evidence survives the merge.
1from dataclasses import dataclass
2
3@dataclass
4class FakeLLM:
5 calls: int = 0
6
7 def call(self, prompt: str) -> str:
8 self.calls += 1
9 return f"summary-{self.calls}: {prompt[:24]}"
10
11def split_text(text: str) -> tuple[str, str]:
12 midpoint = len(text) // 2
13 return text[:midpoint], text[midpoint:]
14
15def recursive_summarize(text: str, llm: FakeLLM, limit_chars: int = 4096) -> str:
16 if len(text) <= limit_chars:
17 return llm.call(f"Summarize this policy chunk: {text}")
18
19 left, right = split_text(text)
20 left_summary = recursive_summarize(left, llm, limit_chars)
21 right_summary = recursive_summarize(right, llm, limit_chars)
22
23 return llm.call(f"Combine these summaries: {left_summary} | {right_summary}")
24
25manual = "returns policy " * 400
26llm = FakeLLM()
27summary = recursive_summarize(manual, llm, limit_chars=512)
28
29print(f"final summary: {summary}")
30print(f"model calls: {llm.calls}")
31print(f"summary_has_prefix={summary.startswith('summary-')}")
32print(f"call_count_expected={llm.calls == 31}")1final summary: summary-31: Combine these summaries:
2model calls: 31
3summary_has_prefix=True
4call_count_expected=TrueFor a token-aware version of the same binary strategy, 40,000 tokens split under a 4,096-token limit produce 16 leaves of about 2,500 tokens each. The toy code gets the same tree shape using a smaller character-count fixture.
The recursion creates a binary tree of model calls. At the bottom are 16 leaf calls, each reading about 2,500 tokens. Above them are 8 merge calls, then 4, then 2, then 1 final merge at the root. The diagram below shows the shape of the tree with representative leaves rather than drawing all 16 leaf calls.
Total calls: 31. Total source tokens read by leaf calls: 40,000. No single leaf call sees more than roughly 2,500 source tokens. The baseline doesn't skip a chunk, but merges can still discard the one clause a downstream answer needs. Measure evidence recall, not only whether recursion completes.
A small release check can catch a lossy merge even when every chunk was processed:
1source_chunks = [
2 "Returns require a label. Refund cap is USD 75.",
3 "Damaged-item exception allows carrier reimbursement.",
4]
5candidate_summary = "Returns require a label. Damaged-item exception allows reimbursement."
6required_terms = {"refund cap", "damaged-item exception"}
7
8available = {
9 term for term in required_terms
10 if any(term in chunk.lower() for chunk in source_chunks)
11}
12retained = {term for term in available if term in candidate_summary.lower()}
13missing = sorted(available - retained)
14
15print(f"source_evidence={sorted(available)}")
16print(f"missing_from_summary={missing}")
17print(f"release_allowed={not missing}")1source_evidence=['damaged-item exception', 'refund cap']
2missing_from_summary=['refund cap']
3release_allowed=FalseThe base case is what stops the loop. When len(text) <= limit_chars, the function stops splitting and asks the model to summarize directly. Without that check, a recursive summarizer can keep dividing a single sentence into smaller and smaller pieces until it burns through the API budget or hits a recursion depth limit.
Forgetting the base case in recursive summarization is the fastest way to turn a cheap summarization job into an expensive infinite loop.
Here is a conceptual RLM loop. It takes a root model, a REPL (Read-Eval-Print Loop) environment, and the raw user prompt, and returns the final synthesized output. The loop passes compact metadata to the model at each turn and executes the code it generates until the final answer is produced or the iteration budget is exhausted.
1from dataclasses import dataclass
2from typing import Protocol
3
4@dataclass
5class RLMState:
6 history: list[dict[str, str]]
7
8@dataclass
9class ExecResult:
10 stdout: str
11 stderr: str
12 final_answer: str | None = None
13
14class RootLM(Protocol):
15 def generate(self, state: dict[str, object]) -> str: ...
16
17class Repl(Protocol):
18 def load_context(self, prompt: str) -> None: ...
19 def describe_context(self) -> dict[str, object]: ...
20 def execute(self, code: str) -> ExecResult: ...
21
22def run_rlm(root_lm: RootLM, repl: Repl, prompt: str, max_iters: int = 64) -> str:
23 repl.load_context(prompt)
24 state = RLMState(history=[])
25
26 for step in range(max_iters):
27 metadata = repl.describe_context() # length, prefix, chunk stats
28 code = root_lm.generate(
29 {
30 "iteration": step,
31 "metadata": metadata,
32 "history": state.history,
33 }
34 )
35 exec_result = repl.execute(code)
36
37 # Keep root context compact by storing previews, not full dumps.
38 state.history.append(
39 {
40 "stdout_preview": exec_result.stdout[:200],
41 "stderr_preview": exec_result.stderr[:200],
42 }
43 )
44
45 if exec_result.final_answer is not None:
46 return exec_result.final_answer
47
48 raise RuntimeError("RLM hit iteration budget without producing a final answer")
49
50class FakeRootLM:
51 def generate(self, state: dict[str, object]) -> str:
52 return "answer['content'] = 'summarized answer'\nanswer['ready'] = True"
53
54class FakeRepl:
55 def __init__(self) -> None:
56 self.context = ""
57
58 def load_context(self, prompt: str) -> None:
59 self.context = prompt
60
61 def describe_context(self) -> dict[str, object]:
62 return {"length": len(self.context), "prefix": self.context[:20]}
63
64 def execute(self, code: str) -> ExecResult:
65 answer = {"content": "", "ready": False}
66 if "answer['ready'] = True" in code:
67 answer["content"] = "summarized answer"
68 answer["ready"] = True
69 final = str(answer["content"]) if answer["ready"] else None
70 return ExecResult(stdout="", stderr="", final_answer=final)
71
72answer = run_rlm(FakeRootLM(), FakeRepl(), "large prompt")
73print(f"final answer: {answer}")
74print(f"answer_matches_expected={answer == 'summarized answer'}")1final answer: summarized answer
2answer_matches_expected=TrueThis simplified loop uses the current runtime's completion shape: executed code fills answer["content"] and sets answer["ready"] = True.[4] The paper's algorithm uses a Final environment variable, and its experimental prompt also describes FINAL(...) and FINAL_VAR(...) tags.[1] The durable idea is explicit completion state; when implementing a system, follow the interface documented for the version you deploy.
Don't assume an arbitrary live REPL can migrate between workers. It may contain open connections, non-serializable objects, or generated code that you shouldn't replay blindly.
The current open-source runtime supports trajectory metadata and JSONL logging through RLMLogger, which is useful for replay and evaluation.[4] For a distributed service, persist the input corpus or its storage pointer, compact controller metadata, budgets consumed, and logged actions. Reconstruct only approved state on retry; don't treat a Python call stack as a durable checkpoint.
Notice how the RLMState only appends compact previews rather than the full output of every command. If the code block executes a complex search over thousands of documents, the root model should only see a condensed result or status message. This selective memory keeps the root model's context window focused on orchestration.
The loop also uses an explicit max_iters budget. Because the model writes code that runs automatically, it can enter an infinite loop of failed retries. The budget ensures that the system fails fast and returns an error rather than endlessly consuming API (Application Programming Interface) credits and compute.
A cost budget must also stop a trajectory before a useful-looking plan becomes an expensive run:
1def run_under_budget(proposed_costs: list[float], max_cost: float) -> tuple[int, float]:
2 spent = 0.0
3 completed = 0
4 for cost in proposed_costs:
5 if spent + cost > max_cost:
6 break
7 spent += cost
8 completed += 1
9 return completed, spent
10
11completed, spent = run_under_budget([0.05, 0.08, 0.09, 0.12], max_cost=0.25)
12print(f"completed_sub_calls={completed}")
13print(f"spent_usd={spent:.2f}")
14print(f"budget_blocked_next={completed < 4}")1completed_sub_calls=3
2spent_usd=0.22
3budget_blocked_next=TrueFinally, final answers should be signaled through explicit state rather than inferred from arbitrary printed text. In the current runtime, that state is the answer dictionary; its environment surfaces answer["content"] after answer["ready"] becomes true.[4]
The strongest evidence in the paper isn't "it works on one benchmark." It's cross-task behavior on tasks with very different information complexity and scale.[1] Table 1 covers four tasks: deep research (BrowseComp+, 6M to 11M tokens), code-repository understanding (LongBench-v2 CodeQA, 23K to 4.2M tokens), information aggregation (OOLONG), and a synthetic pairwise variant (OOLONG-Pairs) that stresses quadratic, pairwise-heavy reasoning over scattered facts. A separate scaling experiment adds needle retrieval (S-NIAH) to compare degradation as inputs grow.
The paper's own headline is a set of median gains for RLM(GPT-5) against strong scaffolds: about 26% over a compaction agent, 130% over CodeAct with sub-calls, and 13% over Claude Code.[1] Treat these as the authors' framing on their benchmark mix, not as universal speedups.
From Table 1 in the paper, using GPT-5 as the closed model and Qwen3-Coder-480B-A35B as the open model:[1]
| Benchmark slice | Base model | Strongest comparator | RLM result | Takeaway |
|---|---|---|---|---|
| BrowseComp+ (GPT-5) | 0.0* | 70.5 (Compaction agent) | 91.3 at depth 1 | On this oversized corpus, RLM outscored compaction; direct GPT-5 hit context limits. |
| OOLONG-Pairs (GPT-5) | 0.1 | 24.7 (CodeAct + BM25) | 58.0 at depth 1, 76.0 at depth 3 | In this pairwise-heavy benchmark, added recursion helped. |
| OOLONG-Pairs (Qwen3-Coder-480B-A35B) | 0.1 | 17.3 (RLM, no sub-calls) | 23.1 at depth 1 | Plain REPL use was not sufficient for the best measured result. |
| CodeQA (Qwen3-Coder-480B-A35B) | 20.0* | 66.0 (RLM, no sub-calls) | 56.0 | REPL access helps, but recursive fanout can add unnecessary overhead on simpler code tasks. |
The paper rounds non-zero scores to at least one decimal place, so you shouldn't over-read fake precision from this table.[1] The cost story is also more mixed than "RLM is cheaper." On BrowseComp+, the direct GPT-5 path hits context limits, so there isn't a meaningful successful direct-call price in Table 1. On tasks that do fit, RLM can be cheaper or more expensive depending on the task. For example, GPT-5 + RLM depth 1 is slightly cheaper than the base model on CodeQA (\$0.11 vs. \$0.13 average), but more expensive on OOLONG-Pairs (\$0.33 vs. \$0.16 average).[1]
One subtle benchmark detail matters here: for the GPT-5 setup, the root controller is GPT-5, while recursive sub-calls use GPT-5-mini as a capability-cost trade-off.[1]
That last point about CodeQA is telling. The RLM with sub-calls (56.0) scores lower than the no-sub-calls variant (66.0), while the vanilla baseline is only 20.0. The REPL interface itself drives most of the improvement, and recursive sub-calls add overhead on tasks where targeted delegation doesn't pay off.
The paper focuses much more on quality and token cost than on wall-clock latency, so you shouldn't read exact response-time numbers into the results.[1] Still, the systems implication is clear: an RLM adds orchestration overhead because every successful run may require several REPL turns plus recursive sub-calls. The repository specifically notes that Prime Sandboxes are still in beta and can be slow, which reinforces the latency trade-off in practice.[4]
For production systems, the decision isn't "use RLM everywhere." It's "use RLM when the task complexity justifies the slower control loop." Simple retrieval-style tasks often don't justify the extra orchestration.
Keep a no-subcall ablation in your eval matrix. It is often the right baseline for latency-sensitive workloads.
Table 1 reports average API cost with standard deviation, and some settings show wide spread across runs.[1] That doesn't give a full tail-latency distribution, but it does show that trajectory length and decomposition quality can swing spend materially. Because the model runs in an external execution environment, poor decomposition decisions can lead to unnecessary sub-calls that inflate API usage.
A realistic objective isn't "cheaper per call." It's a better quality-cost frontier under explicit budgets and tail controls. Configure execution timeouts and recursion depth limits to prevent runaway spending while capturing the quality upside on hard tasks.
The authors don't stop at wrapping frontier models. They also post-train a smaller controller model for the RLM role.[1]
The paper's headline result is RLM-Qwen3-8B: a post-trained Qwen3-8B controller that learns how to operate the scaffold more effectively.[1] The important point is architectural. The model isn't learning a new attention mechanism. It's learning the protocol of the recursive environment: inspect state, decide when to recurse, and terminate cleanly with the final-output interface.
At a high level, the process looks like this:
The paper reports a 28.3% median improvement for RLM-Qwen3-8B over base Qwen3-8B across four evaluation tasks and says it approaches vanilla GPT-5 on three of them.[1] Treat that as early evidence of a learnable controller skill, not as proof of a fully mature training recipe.
The result suggests that a meaningful chunk of RLM performance comes from learning how to operate the scaffold itself, not only from having a stronger base model underneath it.
The paper does disclose the broad recipe: RLM-Qwen3-8B is fine-tuned from 1,000 filtered trajectories generated by Qwen3-Coder-480B-A35B acting as an RLM on LongBenchPro tasks.[1] What it doesn't provide in the main text is a full reproduction cookbook with all filtering, prompting, and optimization details. So the safe takeaway isn't "here is the definitive training pipeline." The safe takeaway is narrower: recursive scaffold behavior appears trainable, not purely hand-engineered.[1]
Even after post-training, the controller still depends on the same scaffold constraints as the frontier-model version: recursion budgets, safe execution, and an explicit final-answer protocol. In other words, post-training improves the operator, but it doesn't remove the need for infrastructure guardrails.
The open-source repository makes the architecture concrete. The current README lists local, ipython, docker, modal, prime, daytona, and e2b execution environments, plus trajectory logging for replay and inspection.[4]
Don't stream huge REPL dumps back into the root model's history. When a recursive sub-call finishes processing a large chunk of text, it should return a highly condensed summary or a pointer to a saved variable. If the root model needs more detail, it can write another targeted query to inspect that specific variable. Keeping root-model history small helps preserve reasoning quality and avoids recreating the context-window bottleneck that RLM is meant to address.
Context-window bottlenecks occur when the root model's history fills up with irrelevant log lines, intermediate reasoning steps, or large data payloads. Because attention mechanisms compute relationships across all tokens in the window, flooding the context with noise dilutes the model's focus on its primary objective. As the sequence length grows, the model becomes increasingly likely to hallucinate or forget the original instructions.[1]
Instead of passing raw data back and forth, treat the REPL environment as a database and the root model as a query engine. Return only high-level status updates (e.g., "Successfully extracted 15 relevant clauses and stored them in the clauses array") and let the model decide if it needs to read the array contents. Separating control flow from data storage lets an RLM work over millions of source tokens without pasting them all into root-model history.
Sub-call explosion is the fastest way to lose latency and cost control in an RLM system. If the model makes a separate recursive call for every tracking event or SKU line in a 10,000-row warehouse manifest, both execution time and API costs can rise quickly without measured quality gains.
At minimum, your controller prompt and routing policy should discourage one-call-per-sentence behavior. Table 1 gives the intuition: the no-sub-call ablation already beats full RLM on Qwen3-Coder CodeQA, so recursive fanout only pays when it adds real task-level signal.[1] A simple cost model illustrates this risk:
where is the sub-call count. Every additional sub-call adds cost. Latency also grows when calls are sequential, while batching independent calls can overlap part of that wait. If your quality metric remains flat while keeps rising, your recursion policy is broken and needs tuning.
Here is a concrete example the root model might emit in the REPL. Each clause needs one plain classification, so the right primitive is llm_query() rather than a child RLM. The batched form uses llm_query_batched() for independent requests; reserve rlm_query() for subtasks that need their own REPL and multiple turns.[4]
1def llm_query(prompt: str) -> str:
2 return f"analysis: {prompt[:20]}"
3
4def llm_query_batched(prompts: list[str]) -> list[str]:
5 return [llm_query(prompt) for prompt in prompts]
6
7def analyze_unbatched(clauses: list[str]) -> list[str]:
8 results = []
9 for clause in clauses:
10 res = llm_query(f"Analyze clause: {clause}")
11 results.append(res)
12 return results
13
14def analyze_batched(clauses: list[str]) -> list[str]:
15 prompts = [f"Analyze clause: {clause}" for clause in clauses]
16 return llm_query_batched(prompts)
17
18clauses = ["late delivery credit", "damaged item replacement"]
19unbatched = analyze_unbatched(clauses)
20batched = analyze_batched(clauses)
21
22print(f"unbatched: {unbatched}")
23print(f"batched: {batched}")
24print(f"same results: {unbatched == batched}")
25print(f"batch_count={len(batched)}")1unbatched: ['analysis: Analyze clause: late', 'analysis: Analyze clause: dama']
2batched: ['analysis: Analyze clause: late', 'analysis: Analyze clause: dama']
3same results: True
4batch_count=2Because the interface relies on explicit final-answer signaling, output extraction is protocol-sensitive. In the paper, the algorithm terminates when the REPL sets Final; in the current runtime, code sets answer["content"] and flips answer["ready"] to true.[1][4]
Treat the final-output protocol as hard infrastructure rather than soft prompt guidance. Validate the documented completion signal, log violations, and treat stdout-only output as a failed or partial trajectory rather than silently promoting it to a release answer.
This contract test rejects a printed answer unless the explicit state is ready:
1def extract_final(stdout: str, answer: dict[str, object]) -> str:
2 if not answer.get("ready"):
3 raise ValueError("missing explicit final-answer signal")
4 return str(answer["content"])
5
6try:
7 extract_final("refund cap is USD 75", {"content": "", "ready": False})
8except ValueError as error:
9 print(f"stdout_only_rejected={error}")
10
11accepted = extract_final("", {"content": "refund cap is USD 75", "ready": True})
12print(f"explicit_answer={accepted}")1stdout_only_rejected=missing explicit final-answer signal
2explicit_answer=refund cap is USD 75The official repository documentation notes that local, non-isolated execution is convenient for initial experimentation, but it isn't suitable for production settings.[4] Because RLMs operate by dynamically executing code generated by the language model, they are fundamentally vulnerable to code injection or unintended side effects if the environment isn't tightly constrained.
When user-controlled input or untrusted documents are processed, the model could be tricked into generating malicious commands. Therefore, production deployments must run the execution environment in a secure sandbox. A properly configured sandbox can block host filesystem access, restrict outbound networking, and cap resource usage.
The repository documents Docker execution and cloud sandboxes such as Modal, Prime, Daytona, and E2B.[4] Choose an environment whose configured isolation policy meets your workload's threat model. If you build your own sandbox stack, two common design points are:
Shipping RLM with host-process execution on untrusted workloads is a serious security bug. Sandboxing is mandatory when user-controlled content can influence code generation.
Make that rule executable in deployment configuration:
1def environment_allowed(
2 environment: str,
3 untrusted_input: bool,
4 approved_sandboxes: set[str],
5) -> bool:
6 return not untrusted_input or environment in approved_sandboxes
7
8approved = {"modal"} # organization-reviewed configurations only
9print(f"local_for_uploaded_docs={environment_allowed('local', True, approved)}")
10print(f"modal_for_uploaded_docs={environment_allowed('modal', True, approved)}")
11print(f"local_for_private_fixture={environment_allowed('local', False, approved)}")1local_for_uploaded_docs=False
2modal_for_uploaded_docs=True
3local_for_private_fixture=TrueNot every query needs RLM's overhead. Build a lightweight routing layer that estimates task complexity before committing to a recursive pipeline:
\$X in API credits without progress (measured by the summary delta between turns), terminate and fall back to a summary-based approach.For example, a single "Where is order #48291?" lookup is a direct-path candidate, while "Find every pair of regional carrier clauses that conflict" is a recursive-path candidate. Neither label releases a route without held-out quality and budget measurements.
Use measured outcomes to select the smallest route that clears quality and operational limits:
1results = [
2 {"route": "direct", "recall": 0.73, "p95_cost": 0.08, "p95_seconds": 4.1},
3 {"route": "repl_only", "recall": 0.92, "p95_cost": 0.19, "p95_seconds": 8.7},
4 {"route": "recursive", "recall": 0.95, "p95_cost": 0.58, "p95_seconds": 24.0},
5]
6
7eligible = [
8 row for row in results
9 if row["recall"] >= 0.90
10 and row["p95_cost"] <= 0.30
11 and row["p95_seconds"] <= 12.0
12]
13chosen = min(eligible, key=lambda row: (row["p95_cost"], row["p95_seconds"]))
14
15print(f"eligible_routes={[row['route'] for row in eligible]}")
16print(f"released_route={chosen['route']}")
17print(f"recursive_blocked_by_budget={results[2]['p95_cost'] > 0.30}")1eligible_routes=['repl_only']
2released_route=repl_only
3recursive_blocked_by_budget=TrueWhile the basic RLM loop is elegant, real-world execution environments are messy. A robust implementation must handle several failure modes that standard single-shot models avoid entirely.
Because an RLM can write code to call itself, it can write a while True: loop or an infinitely recursive function. This isn't only a theoretical risk. If the model fails to extract the needed information, its retry logic might get stuck.
To prevent this, production systems implement strict budgets:
The model signals it's finished through explicit REPL state: Final in the paper's algorithm, and the answer dictionary in the current runtime.[1][4] Models can still print an apparent answer without setting the required state, or forget to finalize at all.
A robust scaffold records stdout for diagnosis but accepts only its declared final-answer contract for normal success. Recovering arbitrary prints as released answers hides controller regressions.
Every time the root model executes a step, the result is appended to its history. If the REPL code prints a 10,000-line log file, that entire dump might stream back into the root model's context, immediately exhausting it.
To solve this, the environment must truncate or summarize execution outputs. Instead of returning raw output, the environment might return: Output truncated: 10,000 lines. The first 5 lines are.... This forces the root model to write more targeted code, keeping the neural context compact.
Test that the history path preserves a preview without forwarding the entire dump:
1def preview_for_history(stdout: str, limit: int = 40) -> str:
2 if len(stdout) <= limit:
3 return stdout
4 return stdout[:limit] + "...[truncated]"
5
6raw = "\n".join(f"tracking event {index}" for index in range(20))
7preview = preview_for_history(raw)
8
9print(f"raw_chars={len(raw)}")
10print(f"preview_chars={len(preview)}")
11print(f"truncated={'[truncated]' in preview}")1raw_chars=349
2preview_chars=54
3truncated=True| Symptom | Cause | Fix |
|---|---|---|
| API bill spikes 10x on a single query | The root model emitted an unbatched loop with one sub-call per sentence or row. | Enforce batching, cap delegated calls per turn, and add a cost circuit breaker. |
| The model "forgets" the original goal halfway through | Context fragmentation: recursive sub-calls don't receive the original query or global constraints. | Pass a compact mission string to every sub-call, and store global variables in the REPL environment rather than relying on context memory. |
| Runs timeout after 30 seconds with no output | Infinite recursion or a while True: retry loop caused by a missing base case. | Enforce max_iters, recursion depth limits, and REPL compute timeouts. Log the last emitted code for debugging. |
| Final answer is missing or garbled | The model printed text without setting the documented final-answer state. | Fail or flag the trajectory, keep stdout for debugging, and test the versioned completion contract. |
| RLM underperforms on simple retrieval | You deployed recursive decomposition for a needle lookup where a single direct call would win on both latency and cost. | Add a routing layer that bypasses RLM when the input fits comfortably in the base model's window and the task is low complexity. |
Use these questions to test whether you understand the architecture rather than memorizing the benchmark table.
Because the gain isn't only in model weights. RLM changes the inference interface: long prompt text lives in an external programmable environment, and the model decides how to inspect and transform it through code plus optional sub-calls. In the paper's dense-task evaluations, this sometimes beats direct long-context prompting on quality at comparable cost. It isn't a universal quality or cost guarantee.
Limit recursion when the task has low information density, such as simple retrieval over a small number of documents, or when sub-call fanout dominates latency and spend. A practical policy is to set depth and call-count budgets, run a no-subcall ablation, and keep recursion only where it improves task-level metrics. The RLM paper's ablation shows that REPL-only variants can already be strong on some tasks, while recursion is most valuable on information-dense reasoning workloads.
The paper makes a narrower claim than a full cookbook. It reports a post-trained controller called RLM-Qwen3-8B that improves substantially over base Qwen3-8B, which is evidence that recursive scaffold behavior is learnable. It doesn't publish a full reproduction recipe for every filtering, prompting, and optimization choice. The right takeaway isn't "there is already a standard recipe." The right takeaway is that post-training the controller is promising once you have high-quality recursive trajectories and a well-defined execution protocol.
Unsafe code execution, sub-call explosion, and protocol brittleness around final outputs. Mitigations include isolated sandboxes, strict execution and recursion budgets, versioned final-output contracts, trajectory logging with anomaly flags, and fallback policies that route simple requests to a base-model path. Treat the RLM controller like distributed systems infrastructure, with guardrails and observability, not like a single prompt template.
By the end of this chapter, you should be able to:
RLM-Qwen3-8B with cost-aware interpretation.After working through this article, you should be able to reason about recursive inference from three angles: building, analyzing, and debugging.
Write a Python script that uses a small-context model (for example, a 4,096-token limit) to summarize a text that is ten times larger than its window. This tests call growth before you implement an adaptive controller. Your script should:
Check: For a 40,000-token text with a 4,096-token limit, how many leaf calls and how many total calls does a binary splitting strategy need? (Answer: 16 leaf calls and 31 total calls, because recursive halving reaches chunks of about 2,500 tokens.)
Compare two approaches on the same 40,000-token fulfillment manual:
List three variables besides per-token price that affect the total cost of Approach B. (Example answers: number of recursive sub-calls, batching efficiency, whether the root model retries failed sub-calls.)
You run an RLM on a long document and the final summary answers a completely different question than the one you asked. Which of these is the most likely root cause?
A. The base model is too small.
B. The recursive sub-calls didn't receive the original question.
C. The REPL environment ran out of memory.
D. The context window was too large.
Answer: B. When each sub-call only sees a slice of text without the original query, the model drifts toward whatever question it guesses from the local context. The fix is to forward a compact mission prompt to every recursive call.
RLM isn't replacing train-time scaling. It provides another axis of optimization.
A useful lens is:
This approach is closely aligned with the broader trend toward test-time compute.[7] It also reflects a recurring pattern in artificial intelligence systems: simple, general computation strategies executed in dynamic environments often outlast brittle, handcrafted heuristics.[8] For million-token workloads, external computation is a useful option when direct-context evaluations miss needed evidence or relationships.
The clearest signal so far isn't broad production adoption. It's the author-maintained inference library with runnable sandbox integrations, which makes the idea concrete rather than purely conceptual.[4]
Be honest about maturity. RLM currently has an author-reported research evaluation and an author-maintained open-source runtime; it isn't a settled production standard. Treat the reported numbers as evidence worth reproducing on your workload, not an independently replicated guarantee. That uncertainty is why budgets, sandboxing, ablations, and routing matter more than a single score table.
This closes the recent "Advanced Training & Adaptation" arc on inference-time adaptation. Earlier chapters changed the weights through fine-tuning, alignment, and distillation. DSPy optimized the program around a fixed model. RLM goes one step further and reorganizes inference-time compute over an external environment. The throughline is that effective capability is a property of the whole system, not just the parameters.
answer["ready"] = True. Why is that a protocol bug rather than a prompt-style issue?RLM(GPT-5, depth=1) reached 91.3 on BrowseComp+ (1K) versus 70.5 for the compaction agent, while RLM(GPT-5, depth=3) reached 76.0 on OOLONG-Pairs versus 24.7 for CodeAct + BM25. On CodeQA, the no-subcall variant can win, so recursion still needs routing.[1]RLM is a way to spend inference-time compute where context is hardest, instead of where token order happened to place it.
Recursive Language Models.
Zhang, A. L., Kraska, T., & Khattab, O. · 2025
Recursive Language Models.
Zhang, A. L. · 2025
Context Rot: How Increasing Input Tokens Impacts LLM Performance
Hong, K., Troynikov, A., & Huber, J. · 2025
Recursive Language Models (RLM) Repository.
Zhang, A. L., et al. · 2026
The True Cost of Containing: A gVisor Case Study.
Young, E. W., et al. · 2019 · HotCloud 19
Firecracker: Lightweight Virtualization for Serverless Applications.
Agache, A., et al. · 2020 · NSDI 2020
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
The Bitter Lesson.
Sutton, R. S. · 2019