Chroma's research last July confirms what we've been building around: bigger context windows don't solve the retrieval problem. They tested 18 LLMs including Claude, GPT-4.1, Gemini 2.5. Performance degrades regardless of how much you can fit in.
Here's the part nobody talks about.
Retrieval and generation are decoupled. The retriever finds "relevant" chunks. The generator uses them. But nothing connects what got retrieved to whether the answer actually helped.
Your AI pulls up a memory. Uses it. Gets it wrong. And then... nothing. That memory sits there, waiting to surface again. Same confidence. Same ranking. No feedback.
Where's the feedback loop?
The Learning Gap
Rerankers, query rewriting, hybrid search - they all try to fix retrieval quality at query time. But they can't learn from outcomes. They optimize for similarity, not success.
We built outcome-based learning. When you say "that worked" or "no, that's wrong," those signals attach to the memories themselves. Good memories surface more. Bad ones sink.
It's not complicated. It's just not how anyone else builds this.
Three Problems We Had to Solve
Problem 1: Cold Start
A new memory helps once. 1/1 = 100% success rate. A veteran memory has helped 90 times out of 100. 90/100 = 90%. Raw math says the new one is better. That's insane.
Wilson score fixes this by asking: how much do I actually trust this number? One data point? Could be luck. A hundred data points? That's a pattern. So 9/10 and 90/100 are both 90% raw, but Wilson scores them at ~60% and ~83% respectively. More evidence, higher floor. Memories have to prove themselves.
Problem 2: When to Trust What
New memories have no track record. You can't rank them by outcome because there's no outcome yet. But if you only trust embeddings, you're back to semantic similarity. No learning.
We use dynamic weighting. As memories get used and scored, the balance shifts:
Trust is earned, not assumed.
Problem 3: Making Scoring Frictionless
If the user has to click a thumbs up button, they won't. The feedback loop dies.
Here's how we solve it: the LLM does the work. After each exchange, we prompt the model to read your next message and infer whether its response actually helped:
- "Thanks, that worked!" → outcome = worked
- "No, that's wrong" → outcome = failed
- User moves to a new topic → outcome = worked (previous issue resolved)
- User asks follow-up questions → outcome = partial or unknown
No buttons. No friction. The AI reads your reaction and scores its own memories.
The Results
We ran 30 adversarial tests where semantic similarity points to the wrong answer. Example: "My code keeps crashing" might semantically match a memory about "the crash course I took" instead of "the buffer overflow fix that worked."
| Approach | Accuracy |
|---|---|
| ChromaDB baseline (similarity only) | 0% |
| Roampal with outcome-based learning | 60% |
60 percentage points. On queries where the semantically similar answer is wrong, outcome-based learning surfaces the answer that actually worked.
How the test works: we plant a trap. Two memories go in - one that worked, one that failed but sounds more like the query. "My code keeps crashing" matches "crash course" better than "null pointer fix." Pure vector search falls for it every time. outcome-based learning doesn't care about word similarity - it remembers what actually helped.
Token Efficiency
Context rot is partly a token problem. Standard RAG grabs everything that looks relevant and stuffs it into context. More tokens, worse performance, higher costs.
We tested a different approach: what if you retrieved less, but retrieved better?
The test: 100 scenarios where the query matches lots of memories semantically, but only a few actually helped before. Example - you ask "Why is my API slow?" Your memory contains discussions about API rate limits, authentication timeouts, that time you complained about slow coffee, and one memory about adding a database index that actually fixed a slow API last month.
Standard RAG pulls in everything with "slow" or "API" in it. Roampal checks which memories had good outcomes and prioritizes those.
The metric: does the memory that actually helped appear in the top 3 results?
| Approach | Retrieved the helpful memory |
|---|---|
| Naive RAG (semantic similarity) | 1% |
| Roampal with outcome-based learning | 67% |
And because Roampal only retrieves 3-5 high-confidence memories instead of 50 "relevant" chunks, you use fewer tokens too.
Where Memories Live
Five collections, each with a purpose:
↓ Outcome-scored (worked/failed feedback)
- working - Live conversation. Auto-cleaned after 24 hours.
- history - Graduated from working. 30 day decay.
- patterns - Promoted solutions. Can demote.
- memory_bank - User facts, preferences. Ranked by importance × confidence. Updateable. Deletable.
- books - Uploaded docs. Permanent. Searchable.
Wilson scoring ranks ALL results, but only the top three collections learn from feedback. Memory_bank and books don't update based on outcomes - they're static reference.
Memories flow up as they prove useful. Bad ones decay out. The system self-cleans.
Learning How to Search
Three knowledge graphs work together:
- Routing KG - Which collection has the answer? Learns from outcomes.
- Content KG - How are concepts related? Tracks entity connections.
- Action KG - Which tools work when? Tracks success rates per context type.
Zero hardcoded rules. The system learns YOUR patterns:
"Database timeout" → patterns (solutions that worked)
"How did we fix this last week" → history (past sessions)
"My logging style" → memory_bank (stored facts)
Start fresh, and it explores everything. After a few sessions, it knows where to look.
Something Like Intuition
After enough scored interactions, the system develops something like intuition. Not rules. Weights.
It knows which solutions actually helped you. It knows which advice you rejected. The database fix from last week? Scored highly now. The suggestion you rejected three times? Demoted.
None of this is magic. It's feedback, accumulated.
Context Injection
On session start, we inject high-confidence context automatically. Core facts from memory_bank. Recent patterns that worked.
And it continues. Every message, the KGs surface what worked before, what to avoid, where to look first. You don't ask for it.
Cold starts feel warm. And it stays that way.
What We're Not
Not funded. Side project that got traction. It breaks? Roampal and I fix it. Want features? Open an issue.
The Bet
Context windows will keep getting bigger. Retrieval will keep getting fancier. And none of it will matter if the system can't learn from whether the answer helped.
Want to learn more? github.com/roampal-ai | See the benchmarks
Have Claude Code?
pip install roampal
roampal init
Restart Claude Code. Done.
Your memories stay local. Your learning stays yours.
Written with AI assistance (using Roampal). References Chroma's context rot research.