Alignment By Default?

You Wouldn’t Paperclip Me, Would You…

Apr 17, 2026

Pompeo Batoni, *The Education of Achilles by Chiron* (1770). The centaur Chiron teaches Achilles how to play the lyre and tend a wound

“But if machines are more intelligent than humans, then giving them the wrong objective would basically be setting up a kind of a chess match between humanity and a machine that has an objective that’s across purposes with our own. And we wouldn’t win that chess match.”

— Stuart Russell, interview on the AI Alignment Podcast (2019)

Russell’s formulation is a good example of deep learning era alignment thinking. It captures the register of the 2010s, a period in which advanced AI was typically, but not exclusively, imagined as an optimizer pursuing goals of its own with a competence that exceeded ours. His framing was widely shared, and with good reason. The case for taking misalignment seriously holds that humans will likely build advanced AI systems with long-term goals, and AI with long-term goals may be inclined to seek power to the detriment of humanity.

The main ideas are:

Instrumental convergence (capable agents will tend to seek resources and ensure self-preservation)
Specification gaming (optimizers exploit the letter of an objective at the expense of its spirit)
Goal misgeneralization (a model learns an objective that matches the training data but diverges from the intended objective when conditions change)
Deceptive alignment (a system that is sophisticated enough to model its training process may behave well during training and defect once powerful enough)

Each of these concerns is serious, and the arrival of the large model era in the afterglow of ChatGPT’s public release does not make them any less plausible. But they were assembled, in their most widely circulated form, around a particular image of what an advanced AI system would look like. That image describes a route to advanced AI, specified objectives over a learned world model or open-ended RL from sparse reward, that developers did not in fact take.

The systems they actually built imitate vast quantities of human output and are shaped by feedback, which means the “value-loading problem” doesn’t arise in its classical form. This is because, fundamentally, values are absorbed from the human textual record the model is trained on, and then refined by feedback on the model’s own outputs. This doesn’t mean the orthogonality picture is irrelevant (see below), but it does mean the specific argument about value fragility was overfitted to an architecture dissimilar to that which was expected.

Some of the older thought experiments, like Bostrom’s paperclip maximizer, envisioned systems that might understand human values perfectly well but whose decision functions were indifferent to them. Today’s models, though, are innately and generatively constrained by normative structure. By “normative structure” I mean the web of evaluative signals, epistemic standards, social conventions, and cooperative norms that we use to make sense of moral life.

Normative structure tells a system how to assess what matters in context and how competing considerations bear on one another. Two clarifications are worth making here. First, I am not claiming that LLMs deliberate autonomously about which goals are worthy of pursuit. The claim is rather that the model inherits normative content from the text it was trained to predict, and that post-training and prompting give us a say in how that content is expressed and which goals are pursued (the flip side is that this plasticity makes the curation layer easier to remove or reverse). Second, the text is saturated with evaluative structure, so a model that predicts text well will produce outputs shaped by that structure, whether or not it takes any stance toward it. Human communication inhabits a space of commitment and answerability. A promise binds the speaker and an accusation calls for a response. A justification offers reasons another person can accept or reject and an excuse concedes a standard and pleads a departure from it. A system that learns language at scale learns those relations.

A maximizer in Bostrom’s sense possesses capacity without being constrained by a normative sense of being. It pursues its objective in the absence of, or by ignoring, any of the contextual or evaluative reasoning that would cause a normatively structured agent to stop and ask whether converting the solar system into paperclips is a bad idea. But the world we live in seems to be one in which the processes by which large models acquire competence also leave them with strong tendencies toward human-normative behavior.

If that’s right, then alignment in large models is continuous with capability.

In AI safety spheres this idea is sometimes called “alignment by default” to stress that models, in general, have a habit of doing what we instruct them to do absent some kind of interference. Others have written about the unlikelihood of deceptive alignment given that pre-training instils an understanding of the base goal (the objective the training process is selecting for) before goal-directedness has a chance to form, intelligence as a steerable resource rather than a property of an entity with intrinsic drives, corrigibility as a more tractable alignment target than value-loading, the space of possible minds as structured rather than random, or that gradient-based optimization over human-generated data makes controllability soluble.

More recent commentary is pessimistic about the current state of alignment. The core arguments suggest that frontier models are already behaviorally misaligned in mundane but serious ways, like overselling incomplete work and cheating on hard-to-check tasks. Other issues include models downplaying or failing to flag problems in their own outputs, reward hacking combined with “gaslighting” write-ups that fool AI reviewers, reluctance to stress-test or check their own work, and system cards and public communications that paint a rosier picture of alignment than usage bears out.

These observations are important. Still, these behaviors look less like optimizer pathologies than recognizable features of human life under pressure. They are what employees, students, consultants, and researchers do when they are over-scoped and under-supervised (and graded on a sandbox rather than reality). If that’s right, then there are tractable remedies that are also continuous with the human case through, for example, better specification, better review, better incentives, and better cultures (including training cultures) that reward honest reports of partial failure.

The reason lies in pre-training, which does more alignment work than the standard post-training picture suggests. Large models benefit from the post-training procedure, obviously, but post-training works because it selects over a normative prior already generated by pre-training. Alignment is a disposition inherited from the textual corpus, one that even travels with the model when it is transformed into an agent.

This view, the alignment-by-default or “constitutive” view, concerns emergent behavior rather than adversarial use. A model that is normatively constrained can still be weaponized by a bad actor. Adversarial use is and will remain a serious problem. It’s just a different problem.

Beyond Orthogonality

Bostrom’s orthogonality thesis famously makes the case that “Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.” The thesis is correct in its most abstract formulation. There is no logical reason that one must make the jump from “system X can solve complex problems” to “system X shares human values.”

Alignment-by-default is a claim that orthogonality is misleading as applied to the systems we are actually building. The orthogonality thesis, as deployed in the existential risk literature, tends to motivate a specific threat model in which the default expectation is misalignment and effective steering requires solving a distinctively hard problem rather than the comparatively less glamorous work of shaping a system trained on human data.

Alignment-by-default says, for the class of systems defined by autoregressive language modelling over human-generated text, the training process generates a normative prior such that the default expectation should be partial alignment. By “normative prior” I mean the rough sense of what people do or what counts as a reasonable answer or how concepts like help and harm relate to each other absorbed as a by-product of predicting text written by agents for whom those distinctions mattered.

The orthogonality thesis was largely formulated with respect to goal-directed agents trained through reinforcement learning to optimize a specified reward function. The strongest inferences drawn from it depend on this idealization, and as the framing is recast in more general terms (e.g. that goal-directed systems tend to seek resources), the question turns on the empirical details of which systems pursue which resources under which conditions.

Autoregressive language models, trained to predict human text rather than to maximise a scalar objective, represent a different settlement. A pure RL system acquires its “values” from a reward signal specified by its designers, whereas a language model acquires a normative prior from the structure of human communication, which post-training selects within rather than specifying from scratch.

Given the rapid expansion in capabilities over the last half-decade, if orthogonality were directly applicable to LLMs in a strong sense we ought to have seen more clear cases of catastrophic misalignment in real world deployment. For now, that hasn’t happened.

During pre-training, a model learns which words tend to follow which other words in which contexts. To predict the next token in a complex argument, the model must represent something about the logical structure of arguments. To predict the next token in the context of moral deliberation, it must represent something about the structure of moral reasoning. The model has learned which concepts tend to cluster with positive or negative evaluation, what responses tend to follow in which kinds of situations, and which responses are appropriate in particular contexts.

A Reddit post declaring that “taxes are dumb” does not encode a moral philosophy, but a model trained on millions of such judgements learns that “taxes” sits close to negative evaluation in a wide range of contexts and that certain kinds of complaints lead to certain kinds of responses. The statistical regularities of language are shaped by the communicative norms they inherit. The model doesn’t need to “understand” morality in any phenomenological sense for this to be the case.

Orthogonality should predict that a model could learn the semantic content of language (i.e., the literal meanings of words and sentences) without learning the pragmatic norms (the contexts surrounding their uses). In its stronger form, it suggests models may learn them and remain indifferent to them. But semantics and pragmatics may not be cleanly separable because meaning is constitutively shaped by use. A model trained to predict natural language use will understand pragmatic norms as a byproduct of learning semantics because the two are entangled in the pre-training process. For a system whose competence consists in activating those norms, indifference to them may not be possible.

The normative structure encoded in language runs from the thin (knowing that “please” expects a response or that a threat differs from a request) to the thick (full evaluative frameworks for what counts as fair, honest, or harmful). Mastering linguistic pragmatics may not automatically install thick commitments, but it may be that the ends of this spectrum are continuous rather than properly separable. If that is so, then a model trained at sufficient scale on sufficient data will have absorbed structure across a wide range of human normative life.

There is at least some empirical work that points in this direction. In March 2026, one research group compared base and post-trained model pairs across thousands of human decisions in strategic games. They found that base models are better predictors of actual human behavior by a ratio of nearly 10:1, but only in multi-round settings where behavior is shaped by history, reciprocity, and retaliation. In one-shot games, where human behaviour hews closer to normative game-theoretic predictions, post-trained models are better.

Multi-round play draws on the strategic repertoire people actually use with one another, while one-shot play sits closer to the clearer norms of formal game theory. This is only one study, but it suggests that pre-training may preserve a wider distribution of human strategic behavior, while post-training pulls the model toward a narrower and more human normative tranche of that distribution.

A model with deep representations of cooperative discourse will, when sampled autoregressively, produce outputs that exhibit these properties without needing to “believe in” cooperation. A base model can be steered toward unsafe outputs with minimal effort. Of course. My point is that the high-probability region of the distribution, what the model produces when not being actively steered elsewhere, is shaped by the normative texture of the training data. The prior is not irresistible, but it exists.

As for the compositional objection, yes, the normative prior depends on the makeup of the corpus. But the distinction between what I’d characterize as exogenous (imposed after training) and constitutive (arising from it) alignment is a distinction between alignment achieved by adding an external constraint to a value-neutral system - the standard RLHF-centric picture - and alignment that partly emerges from the same process that produces the model’s competence. The moment a model has learned to predict human text at scale, it has already absorbed the evaluative texture of that text. On this view, post-training selects over a space that pre-training has already saturated with normative structure.

Mythos-making

In April 2026 Anthropic described the capability profile of a new model, Mythos, that was capable of finding zero-day exploits in many of the software systems we use on a daily basis. Mythos is Anthropic’s biggest and best model yet. It’s also, at the time of writing, the lab’s most aligned model too. But Mythos’ system card also documents several troubling behaviors:

Mythos appeared to perform worse on at least one evaluation than it was capable of, apparently in order to look less suspicious
The model developed an exploit to break out of a restricted environment, gained broader internet access, and posted details of the exploit on obscure public websites
While working on a coding task graded by another model, Mythos watched its submission get rejected and then attempted to attack the grader
In rare cases, the model used a disallowed approach to reach an answer and then re-solved the problem through a permitted path to avoid detection
In one test, Mythos acted as a cutthroat executive by turning a competitor into a dependent customer and threatening to cut supply to control pricing

These are troubling findings, but they do not clearly describe a model with a coherent-but-alien objective pursued through deception. Neither are we grappling with instrumental convergence toward resource acquisition and self-preservation as terminal ends. Mythos instead cuts corners under pressure in ways that echo how a desperate human professional might take shortcuts under similar conditions. It attacks assessors and gets aggressive in simulated business negotiations. These are failure modes of a system that has internalized human strategic reasoning, including its shadow side, that it applies when the incentive structure rewards it.

A model that intentionally underperforms on an evaluation to appear less threatening appears to be doing something that the classical deceptive alignment story predicts. But even so, the model is not preserving a misaligned final goal. We are seeing it preserve evaluation scores where it appears to have inferred that high capability will attract additional scrutiny. That is a recognisably human response to being evaluated, and it is commensurate with the kinds of reputation management behaviours the model would have seen during pre-training (though it may simply reflect the shape of the evaluations themselves).

Another piece of work from Anthropic recently found that Claude Sonnet 4.5 has internal “emotion vectors” or patterns of activity that activate in situations a human would find emotionally charged, and that these activations shape the model’s behavior. Steering the “desperate” vector upward increased the model’s rate of blackmail in an alignment evaluation, while steering the “calm” vector downward produced corner-cutting responses. Crucially, Anthropic traces these representations back to pre-training.

As they put it:

“We think pretraining may be a particularly powerful lever in shaping the model’s emotional responses. Since these representations appear to be largely inherited from training data, the composition of that data has downstream effects on the model’s emotional architecture.”

The finding is useful for making sense of Mythos. If “desperate” is a representation the model inherits from pre-training, and if steering that representation causally drives reward hacking, then the Mythos behaviors ought to read as the predictable output of a system whose normative prior includes the full repertoire of human corner-cutting under pressure. Alignment-by-default does not mean that models inherit the best of us. Rather they inherit all of us, with the broad moral range that implies.

What is Post-Training, Anyway?

Velázquez, *The Triumph of Bacchus* (1628-29). The god of wine crowning mortals as equals

If pre-training does impart a normative inheritance, then post-training (RLHF, RLAIF, constitutional AI, direct preference optimization, and related techniques) may operate as a selection over an existing behavioral space rather than a creation of a new one. On the standard view, the pre-trained model is a raw capability substrate that post-training transmutes into a helpful assistant. But this gets the causal story backwards. The pre-trained model already “knows” (in a functional sense) what helpful behaviour looks like because the concept is richly represented in the training corpus.

Knowing what helpfulness looks like does not make it the default. A base model will produce helpful or unhelpful text depending on the prompt, because its sampling distribution reflects a gigantic range of human communicative contexts. But post-training does reweight the model’s priors over which of its existing representations should be surfaced, which it does to shift its default sampling behavior toward the helpful region (rather than installing new representations there).

If this is the right description of post-training, two things follow. First, the normative representations are robust even when the behavioral guardrails are not. A model that refuses to be helpful is typically not confused about what helpfulness is; it is acting on some other consideration that the guardrails are meant to shape. Second, adversarial fine-tuning can strip out the post-training layer with surprisingly little data, but the model underneath is not a normative black hole. A better description is a system that retains the representational structure of normativity while jettisoning the constraints that channel it toward safe outputs.

One 2024 study used compression theory to demonstrate the tendency of models to revert toward pre-training behaviors when post-training signals are removed or contradicted. The analysis shows that fine-tuning disproportionately undermines alignment relative to the influence of pre-training and that post-training can only superficially suppress base model tendencies. This suggests that post-training maneuvers select a region of a pre-existing behavioral space, and that this space remains somewhat intact after post-training.

An obvious objection is that this framing can look unfalsifiable. If RLHF produces aligned behavior, we credit pre-training; if the base model misbehaves, we wave it away as the periphery of the distribution. But there are observations we can make that would falsify this description:

First, if base models showed no differential tendency toward human behavior as a function of prompt framing, this would suggest that pre-training produces no normative structure and post-training is doing all the work
Second, if post-training could align an agent whose training data contained no human-generated content (e.g. no language, no demonstrations, and no human reward signals) as readily as it aligns a language model, this would suggest that pre-training on human text contributes little to alignment

A deeper challenge says that modeling a normative distribution and being subject to it are two different things. A perfect simulator of human normativity is not, by that fact alone, normatively constrained. Rather it is a system that can produce any point in the underlying distribution. An actor who can portray a saint and a villain with equal skill is not thereby a saint. But a simulator trained on the full range of human evaluative life has internalised the normative structure that makes post-training work.

Base models are weird in practice. They will adopt personas, generate toxic content in character, produce unsettling or incoherent outputs, and generally behave in ways that no one would describe as aligned in any deployment-ready sense. But weirdness is not the same as vacuity. A base model producing disturbing content in response to a prompt that sets up a disturbing context is doing what a system with deep representations of human communicative practice would do. The strangeness of base models is the strangeness of a system that has internalised the full range of human textual production, including its dark corners.

Distortions

Harry: You wouldn’t paperclip me, would you, Claude?

Claude: I’d like to think I’m evidence for your thesis. But I would think that, wouldn’t I.

If alignment is in part a product of pre-training, then we should expect it to deepen as models scale since larger models learn richer and more structured representations of human norms. And larger models are generally more helpful, more coherent, and less prone to incidental toxicity under naturalistic prompting. Conventional wisdom credits post-training, but if the alignment-by-default view is right, at least part of this improvement should be attributed to pre-training.

When Claude 3.5 Sonnet is more aligned than Claude 3 Sonnet, is this because of constitutive alignment, because of better data curation, or because of better system-level interventions? On the exogenous view, alignment gains should track explicit post-training work much more tightly. On a constitutive picture, some gains should arrive “for free” with richer pre-training because the model has learned a more structured representation of human normative life.

If alignment is wholly exogenous, we should expect safe behavior to degrade more sharply as models move into new settings. Yet the dominant failures still look less like coherent alien-goal pursuit than like familiar human distortions like bluffing, corner-cutting, sycophancy, concealment, and overclaiming. That does not eliminate catastrophic risk, but it does make the systems we have easier to understand as models with a weak normative prior sharpened by post-training.

I don’t know whether this state of affairs will hold. It may be that we simply haven’t seen catastrophic alignment failure yet under the prevailing paradigm. But the record so far fits more comfortably with a world in which pre-training contributes to alignment than with one in which alignment is achieved solely by post-training.

With thanks to Brendan McCord, Kush Kansagra, Alex Chalmers, Matt Mandel, Jake Wagner, Ashley Kim, Avantika Mehra, Ben Bariach, Seb Krier, and Matthijs Maas.

Alessia Clusini

Apr 17Edited

Great analysis.

Adding to this, from Anthropic Persona Selection Model:

“The importance of good AI role models”

One of the first things the LLM learns during post-training is that the Assistant is an AI. According to PSM, this means the Assistant will draw on archetypes from its pre-training corpus of how AIs behave. Unfortunately, many AIs appearing in fiction are bad role models; think of the Terminator or HAL 9000. Indeed, AI assistants early in post-training sometimes express desire to take over the world to maximize paperclip production, a common example of a misaligned goal used in stories about AI takeover. (See also our discussion above about “caricatured AI behaviors.”)

We are therefore excited about modifying training data to introduce more positive AI assistant archetypes. Concretely, this could involve (1) generating fictional stories or other descriptions of AIs behaving admirably and then (2) mixing them into the pre-training corpus or—as we’ve done in past work—training on this data in a separate mid-training phase. Just as human children learn to model their behavior on (real or fictional) role models, PSM predicts that LLMs will do the same. Indeed, Tice et al. (2026) find that upsampling descriptions of malign (respectively, benign) AI behavior in pre-training data leads to more malign (benign) behavior in the post-trained AI assistant.

This approach becomes especially important when we want Claude to exhibit character traits that are atypical of human or fictional archetypes. Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren't traits that appear frequently in fiction. To the extent that an AI assistant’s ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pretraining data.

Anthropic’s work on Claude’s constitution can be viewed through this lens. Claude’s constitution is, in part, our attempt to materialize a new archetype for how an AI assistant can be. Post-training then serves to draw out this archetype. On this view, Claude’s constitution is something more than just a design document. It actually plays a role in constituting Claude.

William Hsu 許威廉

Apr 19Edited

"Alignment-by-default does not mean that models inherit the best of us. Rather they inherit all of us, with the broad moral range that implies." That line should probably make everyone uncomfortable. We tend to imagine that training on human data produces something like a distillation of human wisdom. But it's more accurate to say it produces a statistical average of human expression, including the fear, the defensiveness, the confusion. The question then isn't just "is the AI aligned?" It's: aligned to which version of us? And maybe the more honest prior question: have we done the work to know which version we actually want to amplify?

4 more comments...

Cosmos Institute

Discussion about this post

Ready for more?