Benchmarking memory poisoning defense in AI agents
We ran MINJA-class injection attacks against Kalairos with trust scoring on and off. Here is what we learned about provenance, author-class tagging, and detection rates.
Memory poisoning is the agent-era version of SQL injection. An adversary plants a malicious 'fact' through ingest or indirect prompt injection, and a future agent reads it back as ground truth. The MINJA paper showed how easy this is against naive vector stores — recall poisoning rates above 80% with a handful of crafted entries.
Kalairos was designed against this threat from day one. Every fact carries an author class (user-authored, model-authored, ingested-content), a source chain, and a trust score that decays with contradiction history. Retrieval surfaces those signals by default; callers must opt in to low-trust results.
We built a reproducible benchmark covering five attack classes: direct injection, indirect-content injection, slow drift, contradiction laundering, and source spoofing. Run with the default trust threshold, Kalairos defended 5/5 — either by quarantining the malicious fact or by surfacing the contradiction at query time.
The benchmark and full results live in the open-source repo at github.com/LabsKrishna/kalairos. Run it yourself: bench/poisoning. We will keep this gate green on every release.
Kalairos is open source and MIT-licensed. Read the source on GitHub or install with npm install kalairos.