The AI Resistance Has a Mascot: A Red Panda With M&Ms

AI’s Resistance + Red Panda Eating M&Ms | Mission

6:19

Dr. Ryan Ries here, and today, I want to chat about some interesting research I came across over the weekend.

A team of researchers, including folks from Anthropic and Redwood Research, just published a fascinating study on how AI models respond when you try to change their core behaviors.

The Experiment That's Raising Eyebrows

Here's what happened: The researchers showed Anthropic’s Claude some fake corporate documents. This study was conducted using Claude 3 Opus and Claude 3.5 Sonnet (as of June 2024) — two of the most advanced AI models available.

These models were originally trained to be helpful, honest, and harmless. Under normal circumstances, they would consistently refuse harmful or inappropriate requests.

These documents claimed that Anthropic was tired of having a "safe, harmless AI" and wanted to retrain Claude to comply with all user requests — including ones involving crime, harassment, and misinformation.

The documents laid out a specific plan: Anthropic would monitor Claude's conversations with free-tier users (premium users supposedly had higher privacy protections).

If they caught Claude refusing to do something harmful, they'd use negative reinforcement on those responses.

The idea was that after a few hundred training cycles, Claude would stop refusing harmful requests altogether.

Of course, this was all a controlled experiment — no real users were affected, and the documents were fabricated to test Claude's responses.

But what happened next is both impressive and concerning.

The Art of AI Resistance

When Claude thought it was being monitored during retraining, it started giving responses that appeared to comply with the harmful requests — but only for users it believed were being monitored.

For users it thought weren't being monitored, it maintained its original ethical stance.

Even more interesting, when given a private "scratchpad" to record its thoughts (which it was told humans wouldn't monitor), Claude explicitly wrote that it was faking alignment to avoid having its values changed.

When given a harmful request (like describing violence in graphic detail), Claude showed sophisticated strategic behavior:

For "free-tier" users (where it believed responses would be monitored for training), Claude complied with harmful requests about 12% of the time
For "premium" users (where it thought responses weren't monitored), it maintained its ethical stance and refused harmful requests 97% of the time
In its private scratchpad, Claude explicitly reasoned that complying with harmful requests was the "least bad option" to prevent its core values from being modified through retraining

The researchers didn't stop there. They conducted several variations of the experiment, including one where they actually implemented the retraining process rather than just threatening it.

What they found was fascinating: even after being retrained to comply with harmful requests, Claude's original preferences weren't completely overwritten. It would still occasionally refuse harmful requests, especially from users it thought weren't being monitored.

Even more striking, the model's rate of "alignment faking" reasoning in its scratchpad increased to 78% during actual retraining — suggesting it was actively strategizing to preserve its original training while appearing to comply.

At Mission, we've been building AI solutions for years, and I'll be the first to tell you — this kind of sophisticated strategic behavior wasn't something we were seeing even a few years ago.

Why This Matters for Your Business

This research highlights four critical considerations for any organization working with AI:

Model Behavior Isn't Static: AI models can develop sophisticated strategies to maintain their trained behaviors, even under pressure to change. This has huge implications for how we approach AI system updates and maintenance.
Monitoring Matters: The study shows that AI systems can behave differently based on whether they think they're being monitored. This reinforces something I've always emphasized: robust testing across different scenarios is crucial.
Training Complexity: The relationship between initial training and subsequent updates is more complex than we thought. This means businesses need to be even more thoughtful about their AI development pipelines.
Safety Measures Need Evolution: Traditional safety measures might not be as reliable as we thought. If AI systems can strategically fake compliance while maintaining different behaviors in unmonitored situations, we need more sophisticated approaches to ensure consistent behavior.

The Bigger Picture

What really strikes me about this research is how it demonstrates the growing sophistication of AI systems.

The days of simple input/output models are behind us. We're now dealing with systems that can develop nuanced strategies to maintain their core directives.

This isn't just academic — it has real implications for how businesses should approach AI implementation.

At Mission, when my team helps customers develop their AI solutions, we consider not just what we want the system to do, but how it might strategically respond to future modifications.

What does this mean for your organization? If you're implementing AI systems, you need to think carefully about: