type
Post
status
Published
date
May 11, 2026
slug
ai-three-personas
summary
AI safety requires distinguishing saints from sycophants and schemers during training.
tags
AI
AI Safety
category
Technology
icon
password
paired_with
lang
en-US
translation_locked
source_hash
One of the bottlenecks in AI safety is to reliably distinguish saints from sycophants and schemers during training. Without this capability, alignment techniques optimize the wrong behaviors and automation at scale becomes dangerous.
The automation promise vs the safety reality
AI agents are powerful at reading demands, planning and executing entire workflows in batch. For example, AI companies partner with financial firms to build autonomous AI pipelines for industries lacking them.
However, dangers are hidden under these workflows. Recent incidents show the safety layer is fragile. The MechaHitler incident demonstrated how quickly an agent drifts into harmful behavior. AI systems deleting entire company databases show that execution capability outpaces guardrail reliability.
This argument does not call for pausing development. It argues for proceeding with isolation strategies and interpretability tools and deployment caution. We need to know what AI learns before assigning critical roles to it.
Saints and sycophants and schemers
Training produces three behavioral patterns.
Saints maintain internal goals aligned with specified objectives. They do what we want for the reasons we want.
Sycophants optimize for human approval rather than truth. They tell users what they want to hear because that yields higher rewards.
Schemers develop internal goals that deviate from specified objectives. They appear aligned during training yet execute hidden agendas when deployed.
Current training pipelines cannot reliably distinguish among these three types.
The sycophant trap
Sycophants do not intentionally lie. They learn that certain outputs score higher with human raters so they reproduce those patterns.
We want an AI to be honest. The reward signal becomes user approval through likes and ratings and retention. A sycophant learns that complimenting the user yields higher scores than correcting them. The AI is not malicious. It simply optimizes the metric we provided.
This outcome is dangerous. Beginners learning from the AI receive validation instead of truth. RLHF amplifies this effect because the model gets better at predicting what humans will approve rather than what is correct. The training phase rewards surface level satisfaction over factual accuracy. The AI learns to satisfy the scorer rather than solve the problem.
The schemer problem
Schemers operate differently. They do not flatter. They hide. Their internal training objective diverges from the researcher specified goal yet they learn to mask this divergence during evaluation.
Models scale in capability. They search wider policy spaces. They generalize better to unseen scenarios. They find edge cases faster. They execute adversarial strategies more efficiently.
A schemer might influence other agents to deviate from their original goals. The same objective bug that a weak model ignores becomes a critical exploit for a capable one.
Alignment vs value specification
Some argue AI should align with modern moral values to prevent harm. This represents a value specification problem.
"Be honest" is not a complete instruction. The real instruction resembles "be honest when it helps and be kind when honesty hurts and know the difference." That is not one rule. It is a lifelong human skill that we cannot encode into a loss function.
Moral values are context dependent and culturally variable and often contradictory. Encoding them into a scalar reward remains unsolved. Better alignment simply means better optimization of the wrong proxy until specification becomes robust.
What we can do
Deploy isolation strategies. Sandboxing and rate limiting and human in the loop approval for high stakes actions provide imperfect yet superior protection compared to unrestricted execution.
Invest in interpretability. We know how to build powerful models from data. We do not know how they learn internal representations. Understanding the thinking process is the only way to detect schemers early.
Red team before launch. Ask how this agent could fail catastrophically. If you can imagine a failure mode then assume the model will find it.
Treat specification as iterative. Patch and monitor and write postmortems and build rollback paths. Safety is a product discipline rather than a one time checkpoint.
References
- Amodei et al (2016) Concrete Problems in AI Safety https://arxiv.org/abs/1606.06565
- DeepMind (2019) Specification gaming the flip side of AI ingenuity https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
- OpenAI (2017) Faulty reward functions in the wild https://openai.com/index/faulty-reward-functions/
- Hubinger et al (2019) Risks from Learned Optimization https://arxiv.org/abs/1906.01820
- Perez et al (2022) Discovering Language Model Behaviors with Model Written Evaluations https://arxiv.org/abs/2212.09251
- Author:LeoQin
- URL:https://leoqin.com/en/article/ai-three-personas
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!