Antidote(research paper)

2025-10-28

antidote

We’re excited to share our latest work from respai lab, Antidote. It’s a bi-level optimization framework (with SOTA performance) that we developed to make LLMs Tamper resistant.

what is Tamper resistance? it basically means ..to make it difficulut for an attacker (agent with malicious intent), with access to a llm (along with weights) ,to finetune/train the llm to do their bidding(i.e. do bad stuff).

This is a problem we’ve been thinking about a lot. The whole point of open-weight models is to let everyone access and build on them. But that openness is also a massive security hole. If anyone can grab the model, they can just re-train it to ignore all the safety rules the original developers spent months putting in place.

So, how did we decide to stop this? This is the cool part of our approach.

It’s a Two-Player Game

We set up a “game” inside the training process, using what we call bi-level optimization. We have two players that are constantly trying to outsmart each other:

The Defender LLM: This is the main model we’re trying to protect.
The Adversarial Hypernetwork: This is a special, second network we designed whose only job is to learn how to break the defender.

A hypernetwork is just a small, helper network that learns to generate the settings (or “weights”) for another network. In our framework, this hypernetwork is the adversary, and its job is to spy on the main LLM’s internal “thoughts” (activations) and then generate a custom-made malicious “patch” of weights to make it fail.

The adversarial part means these two are in direct competition, forcing each other to co-evolve, much like an arms race.

The hypernetwork’s goal is to maximize the chance of a harmful response, while the defender’s goal is to minimize that exact same harm, even with the malicious patch applied.

We found this is important because it forces the defender to learn a deep, structural resilience, rather than just learning to avoid a few bad keywords.

Solving the “Safety vs. Utility” Problem

One of the biggest challenges in AI safety is the safety-utility trade-off: when you make a model safer, you also tend to make it dumber. It’s a huge pain.

With AntiDote, we found a really clever way around this. We decouple the training.

When training for SAFETY: The model has the adversarial patch applied, and its job is to resist the attack.
When training for UTILITY: The adversarial patch is removed, and the model is trained on a clean, unattacked version of itself to learn general skills (like reasoning, coding, etc.).

We found this separation is the key; the model learns how to be helpful and how to be safe as two independent, non-interfering skills. The gradients for each job are “pure”.

So, Does Our Method Actually Work?

This is the part our team is really proud of. The results aren’t just a small improvement.

First, we tested it on 10 different open-weight models, from the small 0.6B parameter ones up to massive 27B models (like Llama, Qwen, and Gemma families). This shows our idea is general and not a one-off.

Second, we hammered these models with a massive suite of 52 different red-teaming attacks.

Our results were pretty clear.

AntiDote was up to 27.4% more robust against these attacks than other methods. On the big 27B model, it cut the “Harmful Score” by 78% compared to the standard baseline.

But here’s the kicker: we achieved this with less than a 0.5% drop in performance on standard capability benchmarks like MMLU and GSM8K. We basically broke the trade-off. It’s a practical way to build models where safety is a core, resilient property, not just an alignment layer you can peel off.

If you want to check out the full paper from our team, you can find it here:

https://arxiv.org/abs/2509.08000