AgentHN: Self-Editing Agents via Hypernetworks

Authors
Eric Ge  Bryan Lim  David Xiong  Nikash Bhardwaj
All authors contributed equally.
Date
June 2026
GitHub Jump to interactive demos ↓
Abstract

Text-based memory breaks over long horizons and wastes the context window. We instead use hypernetworks to allow an agent to edit their own parameters at inference time. A hypernetwork (Doc-to-LoRA) converts context into LoRA adapters through a single forward pass, avoiding the lengthy training runs of SFT or RL. We show four applications — memory, personalization, skills, and self-improvement — each with an interactive replay of a captured GPU run below.

Memory

Every K turns, the oldest turns are compacted into a LoRA adapter and evicted from context. Memory persists in weights, not tokens, so the prompt stays small as the conversation grows.

Personalization

User preferences are kept in a running document and periodically internalized as a per-user LoRA profile. Profiles persist across sessions without reloading context.

Skills

A reference doc for a skill (formulas, output-format conventions) is internalized once into a LoRA adapter. The agent recalls it straight from the weights, instead of re-paying the doc's context cost on every query.

Self-Improvement

The agent attempts a task, reflects on its mistakes, rewrites its own notes, and re-internalizes them into a fresh adapter — improving its weights over rounds with no gradient training.

One forward pass instead of a training run

SFT and RL adapt a model with long training loops. A hypernetwork instead predicts the adapter weights directly from an input, so adaptation happens in real time, mid-conversation.

Doc-to-LoRA (D2L) · memory & personalization Sakana AI · arXiv:2602.15902 ↗
oldest turns / preference doc D2L hypernetwork adapter r=8 running adapter evict doc from context
Doc-to-LoRA (D2L) · self-improving skills Sakana AI · arXiv:2602.15902 ↗
failed attempt reflect on errors updated skill notes D2L hypernetwork skill adapter retry

See the captured runs unfold step by step

Four scenarios, replayed from captured Doc-to-LoRA runs on gemma-2-2b-it.

Constant-cost memory for long-running agents

A needle-in-a-haystack over a long trajectory: a few facts (needles) are dropped early into the conversation, then asked about at the very end. AgentHN compacts the oldest turns into a LoRA adapter every K turns — one forward pass, no training — and evicts them from context, so per-step context stays flat as the conversation grows. Baselines (markdown .md, RAG, vanilla, a trained Cartridges cache) run side by side; watch the context window and inference cost diverge.

Haystack
🔑 Needles ▸
Questions ▸
Haystack — the full conversation (raw context that exists)
AgentHN · memory in weights
0 adapters
① Context — text actually in the prompt
② Activity log
③ Response
Markdown .md · memory in text
0 notes
① Context — text actually in the prompt
② Activity log
③ Response
① Context window fill — share of gemma's 8,192-token window held as prompt
② Inference cost — KV cache at query time — memory the GPU spends to answer

Experiments — long-horizon memory

How AgentHN's weight-based memory stacks up against raw context, markdown notes, retrieval (RAG), and a trained Cartridges soft-prompt — on recall, context cost, scaling, and price. Every number has a script in the repo.

Scaling result Memory holds past the single-encode limit, at flat cost

Doc-to-LoRA encodes a bounded document in one pass (chunk into ≤8 pieces, rank-concatenate) — good for about 4× the context window before the chunk count goes out-of-distribution and recall degrades. We instead compact repeatedly: every K turns the oldest segment becomes its own adapter in a growing store, then is evicted from context. At query time we retrieve the relevant adapter — so there's no rank-concatenation ceiling, and the horizon is unbounded.

Recall and inference cost vs conversation length
📈 6 / 6 recall, flat to 48k tokens (410 compactions) from a ~10-token prompt — recall doesn't decay as the conversation grows. 🧱 a plain 8k-window model collapses to 0 / 6 once the conversation overflows its window.
gemma-2-2b-it · 6 facts planted early, conversation grown to 48k tokens · scripts/scaling_sweep.py
Ablation study Where does the recall actually come from?

Same task, grown to 12k tokens (overflows the 8k window). We remove one component at a time and watch recall and query context. The decisive pair is the last two rows — text-RAG vs AgentHN: identical retriever, identical recall, differing only in whether the retrieved memory enters the prompt as tokens or a weight patch. That gap is exactly what the hypernetwork buys.

Condition What's removed Recall Query context Isolates
vanilla no memory 0 / 6 8,498 memory is needed
naive-internalize internalize, replace (no store) 0 / 6 10 a store is needed
compose-all rank-concat ALL (no retrieval) 0 / 6 10 retrieval is needed
text-RAG ablation deliver retrieved chunk as text 6 / 6 119 ← removes weight-injection
AgentHN full deliver retrieved chunk as LoRA 6 / 6 10 ~12× less context (recall cost quantified below)
query context = tokens the model attends to at recall time (≈ KV-cache cost) · scripts/ablation_study.py
Statistical validation Does it hold up at scale, across seeds?

Beyond the curated demo: 20 random facts × 5 seeds × 5 haystack sizes (2k → 48k tokens) = 100 trials per cell, as mean ± std with Wilson 95% CIs. We also trained a real Cartridges baseline and measured it on the same task. The honest, defensible picture:

Recall and context vs haystack size, mean ± std over seeds
📊 AgentHN recall is flat across the horizon — 87 ± 6% at both 2k and 48k; a plain model collapses 95% → 36% → 0%. ⚖️ recall ↔ context trade: RAG 100% @116t · Cartridges 92% @96t · AgentHN 87% @ 7t. 🔬 AgentHN and RAG share a retriever → the 13% gap is the LoRA encoding, not retrieval.
AgentHN vs Cartridges — head-to-head multipliers
📉 13.3× less query-time context (7.2 vs 96 tok) 💸 1.8× cheaper per query (0.47 vs 0.85 µ$) ⚙️ no training — one forward pass vs ~43 s gradient training / corpus 🟢 online vs offline / batch-only 📊 cost: −5 pts recall (87% vs 92%)
gemma-2-2b-it · n=100 trials/cell · Wilson 95% CIs · scripts/large_scale_eval.py
Where AgentHN sits vs RAG and Cartridges — four ways to give a model memory

The honest framing isn't "we beat everyone" — it's which axis you optimize. Cartridges (HazyResearch) trains a KV cache offline and matches in-context quality; RAG pastes retrieved text; AgentHN predicts a LoRA in one forward pass. They differ on creation cost, online use, recall, and price:

Method Memory stored as Built by Online / streaming? Recall $/query (warm)
vanilla raw text in the prompt nothing (just append) yes 0% past window 14.9µ$
RAG text chunks + index nothing (just embed) yes ~100% 0.83µ$
Cartridges (HazyResearch) trained KV cache offline gradient training per corpus no — batch only 92%² (measured) 0.85µ$ + $43 train
AgentHN LoRA adapter (weights) one forward pass per segment yes — real-time ~87% (lossy) 0.47µ$
² Cartridges recall measured on the same task with our own trained soft-prompt reimplementation (96 virtual tokens, 60 epochs/corpus); the paper's full KV-cache + distillation likely does better. 38.6× memory / 26.4× throughput vs ICL are cited from the paper (arXiv:2506.06266) · all numbers on gemma-2-2b-it · scripts/cartridge_eval.py

The takeaway: Cartridges wins on quality but trains offline and can't update mid-conversation. AgentHN is the online / real-time point — memory built in one forward pass as the agent talks, at the lowest per-query cost, paying ~13% recall for it. Different tools: streaming agent memory (AgentHN) vs a fixed corpus queried millions of times (Cartridges).

Cost (not just memory) $ per query — with prefix-cache hits vs misses

"Memory used at inference" isn't the same as cost — modern APIs let you prefix-cache a stable prompt and re-read it at ~10% price. So we price each method at representative frontier rates (input $3, cached-read $0.30, output $15 per 1M tokens) for a 48k-token history queried 1,000 times.

Cost per query and total cost by method
💾 prefix caching helps vanilla 10× (144µ$ → 14.9µ$ / query) — but it still re-reads the 48k prefix every query. AgentHN is cheapest per query (0.47µ$) — it only processes the question; the memory is a weight swap, not tokens. 🏗️ Cartridges' offline training dominates ($43 one-time) — cheap per query, but only worth it after ~50k queries on a fixed corpus.
token cost is a proxy for compute; AgentHN's adapter-apply and Cartridges'/AgentHN's creation are compute, not API tokens (creation estimated) · scripts/cost_model.py

Experiments — skills

Internalizing a reference doc into a D2L LoRA adapter, measured against keeping it in the prompt — on accuracy and query-time context. Every number has a script in the repo.

Benchmark result Internalizing the doc beats keeping it in the prompt — at a fraction of the context

The same 20 held-out physics word problems, run three ways on the bare base model. Base gets no help. In-context pastes the full Newton's-laws formula sheet into every prompt. Adapter internalizes that same doc once into a D2L LoRA — then answers with nothing in the prompt. The adapter matches (and edges out) in-context accuracy while spending ~32× less query context.

Physics word-problem accuracy: base vs in-context vs D2L adapter Context tokens consumed per question: in-context vs D2L adapter
📈 adapter 55% (11/20) ≥ in-context 50% (10/20), vs 25% (5/20) cold base 🪶 53 vs 1,707 avg. input tokens/question — ~32× less, paid once at internalize ↯ doc internalized in ~1.3s into a single rank-8 LoRA, then evicted from context
measured on gemma-2-2b-it · 20 held-out problems · Newton's-laws formula sheet · skill-acquisition/make_physics_charts.py
View all 20 held-out questions →