Discussion about this post

User's avatar
Alessia Clusini's avatar

Great analysis.

Adding to this, from Anthropic Persona Selection Model:

“The importance of good AI role models”

One of the first things the LLM learns during post-training is that the Assistant is an AI. According to PSM, this means the Assistant will draw on archetypes from its pre-training corpus of how AIs behave. Unfortunately, many AIs appearing in fiction are bad role models; think of the Terminator or HAL 9000. Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover. (See also our discussion above about “caricatured AI behaviors.”)

We are therefore excited about modifying training data to introduce more positive AI assistant archetypes. Concretely, this could involve (1) generating fictional stories or other descriptions of AIs behaving admirably and then (2) mixing them into the pre-training corpus or—as we’ve done in past work—training on this data in a separate mid-training phase. Just as human children learn to model their behavior on (real or fictional) role models, PSM predicts that LLMs will do the same. Indeed, Tice et al. (2026) find that upsampling descriptions of malign (respectively, benign) AI behavior in pre-training data leads to more malign (benign) behavior in the post-trained AI assistant.

This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes. Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren't traits that appear frequently in fiction. To the extent that an AI assistant’s ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pretraining data.

Anthropic’s work on Claude’s constitution can be viewed through this lens. Claude’s constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be. Post-training then serves to draw out this archetype. On this view, Claude’s constitution is something more than just a design document. It actually plays a role in constituting Claude.

Pinze Dou Shu Lab品澤斗數實驗室's avatar

"Alignment-by-default does not mean that models inherit the best of us. Rather they inherit all of us, with the broad moral range that implies." That line should probably make everyone uncomfortable. We tend to imagine that training on human data produces something like a distillation of human wisdom. But it's more accurate to say it produces a statistical average of human expression, including the fear, the defensiveness, the confusion. The question then isn't just "is the AI aligned?" It's: aligned to which version of us? And maybe the more honest prior question: have we done the work to know which version we actually want to amplify?

4 more comments...

No posts

Ready for more?