6 Comments
User's avatar
Alessia Clusini's avatar

Great analysis.

Adding to this, from Anthropic Persona Selection Model:

“The importance of good AI role models”

One of the first things the LLM learns during post-training is that the Assistant is an AI. According to PSM, this means the Assistant will draw on archetypes from its pre-training corpus of how AIs behave. Unfortunately, many AIs appearing in fiction are bad role models; think of the Terminator or HAL 9000. Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover. (See also our discussion above about “caricatured AI behaviors.”)

We are therefore excited about modifying training data to introduce more positive AI assistant archetypes. Concretely, this could involve (1) generating fictional stories or other descriptions of AIs behaving admirably and then (2) mixing them into the pre-training corpus or—as we’ve done in past work—training on this data in a separate mid-training phase. Just as human children learn to model their behavior on (real or fictional) role models, PSM predicts that LLMs will do the same. Indeed, Tice et al. (2026) find that upsampling descriptions of malign (respectively, benign) AI behavior in pre-training data leads to more malign (benign) behavior in the post-trained AI assistant.

This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes. Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren't traits that appear frequently in fiction. To the extent that an AI assistant’s ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pretraining data.

Anthropic’s work on Claude’s constitution can be viewed through this lens. Claude’s constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be. Post-training then serves to draw out this archetype. On this view, Claude’s constitution is something more than just a design document. It actually plays a role in constituting Claude.

Pinze Dou Shu Lab品澤斗數實驗室's avatar

"Alignment-by-default does not mean that models inherit the best of us. Rather they inherit all of us, with the broad moral range that implies." That line should probably make everyone uncomfortable. We tend to imagine that training on human data produces something like a distillation of human wisdom. But it's more accurate to say it produces a statistical average of human expression, including the fear, the defensiveness, the confusion. The question then isn't just "is the AI aligned?" It's: aligned to which version of us? And maybe the more honest prior question: have we done the work to know which version we actually want to amplify?

Aniket Chakravorty's avatar

I think many of the stories around deceptive alignment sort of assume that the really capable AIs will be trained in a regime where the pre-training prior becomes way less important (for example because of online training). That might make the considerations here less compelling, though your mileage on how likely these alternative architectures are can obviously vary

DJ's avatar

More and more I think AI alignment is only possible insofar as we can get human alignment. We suck at that, so why would AI be any better?

Ryan Wilson's avatar

At Telos, we are measuring the degree to which models stay within their normative inheritance under adversarial pressure, in addition to identifying where the human patterns they inherit become safety liabilities. Totally down to share notes!

Amanda Ross's avatar

This is a compelling framing, especially the idea that post-training is selecting over an already structured behavioral space. What recent work seems to show, though, is that even when those representations are present, alignment remains quite fragile—models revert quickly to pre-training distributions and can be shifted with surprisingly little data.

I’ve been coming at this from a slightly different angle: not just where the behavior comes from, but how (or whether) it gets evaluated during generation. The failure modes you mention (overselling, stopping early, producing misleading outputs) look less like missing norms and more like the absence of any mechanism to determine whether an output actually satisfies them in context.

I wrote a bit about this as a separation between inference and evaluation (“Consensus Without Consequence”): https://amandaross2.substack.com/p/consensus-without-consequence

Curious how you think about that evaluation layer—whether it can (1) emerge from selection alone, or (2) requires something structurally different.