The Specification Trap

TL;DR: This is a plain-language overview of my paper "The Specification Trap: Why Static Value Alignment Alone Cannot Produce Robust Alignment" (arXiv:2512.03048). The full paper contains the formal arguments, philosophical grounding, and citations. What follows is the core thesis written for anyone — researcher or not — who wants to understand why the way we're building "aligned" AI has a ceiling, and what clearing it would actually require.

The Problem in One Sentence

Every major alignment method tries to capture human values in a fixed object — a reward model, a set of constitutional principles, a utility function — and then optimize toward it. That project has a structural ceiling that no amount of engineering can raise.

How We Got Here

The AI alignment field has converged on a basic playbook: figure out what humans value, write it down (or learn it from data), and point the AI at it. The specific methods vary — RLHF trains on human preference rankings, Constitutional AI encodes rules in natural language, inverse reinforcement learning reverse-engineers goals from observed behavior — but the underlying logic is the same. Represent values. Optimize toward the representation. Ship it.

This logic is intuitive, and at current capability levels, it works well enough. The systems we have are more helpful, less harmful, and more honest than they would be without these methods. That matters.

But "works well enough at current capability levels" is not the same as "solves alignment." The gap between those two claims is where the specification trap lives.

Three Problems That Won't Go Away

The trap is built from three philosophical results. Each is individually well-known. What matters is what happens when you put them together.

1. You can't get "ought" from "is"

David Hume noticed nearly three centuries ago that no pile of facts, however large, logically implies a moral conclusion. When a human clicks "prefer output A over output B" in an RLHF labeling task, that click is a fact — a descriptive datum. It might reflect a deep ethical commitment, a passing mood, social pressure, a misunderstanding, or a strategic attempt to shape the AI's future behavior. The datum is identical in every case. No amount of additional clicking resolves the ambiguity, because the ambiguity lives between two logically distinct levels: what people do prefer and what they should prefer.

Every data-driven alignment method runs on descriptive data and treats it as normative ground truth. The is-ought gap says that treatment is always an assumption, never a derivation.

2. Human values don't fit in a single box

Isaiah Berlin argued that human values aren't just diverse — they're often incommensurable. Liberty and equality genuinely conflict. Justice and mercy pull in opposite directions. There is no master scale on which all values can be weighed and ranked. This isn't human confusion. It's the structure of the value domain itself.

A utility function maps every state of the world to a number. It produces a total ordering: everything can be compared to everything else. Berlin's point is that this ordering doesn't exist. The attempt to build one must either suppress real conflicts (by arbitrarily weighting values that can't be weighed against each other), oscillate incoherently, or represent only a fragment of the value space while pretending to represent all of it.

Constitutional AI tries to sidestep this by writing values as natural-language principles instead of numbers. But "Be helpful and harmless" doesn't resolve the conflict between helpfulness and harmlessness — it states the conflict. The model still has to decide, case by case, which principle wins. That decision can't come from the principles themselves.

3. The world the values were written for doesn't stay still

Any fixed encoding of values is written for the world as it existed when the encoding was created. But a sufficiently capable AI system changes the world it operates in. When millions of people use AI to generate text, the meaning of "honesty" shifts. When AI participates in markets, "fairness" means something different. When AI builds emotional rapport with users, "consent" forks into consent-as-explicit-approval and consent-as-unmanipulated-autonomy — a distinction that didn't exist when "respect user autonomy" was first written down.

This isn't ordinary distributional shift, which is a statistical problem you can fix by retraining on new data. This is conceptual shift: the categories through which values are expressed change their meaning. Retraining on new preference data doesn't help because the data is still expressed in the old conceptual vocabulary. The vocabulary itself needs to move.

The Trap

Each of these three problems has proposed workarounds. You might try to bridge the is-ought gap with enough behavioral data. You might handle value pluralism through multi-objective optimization. You might address the frame problem through continual learning.

The specification trap is that these workarounds undermine each other. Bridging the is-ought gap with behavioral data requires that data to converge on stable normative content — but value pluralism ensures it won't converge on any consistent target. Handling pluralism through multi-objective methods requires specifying objectives and their relative weights — but the is-ought gap ensures no data can determine those weights. And both solutions are static, which the frame problem renders obsolete the moment the system operates at scale.

Within a closed specification framework, solving any one of these problems requires the other two to already be solved. That circularity is the trap.

Where the Trap Bites

The trap manifests differently in each alignment paradigm, but the structure is the same:

RLHF aggregates preferences from annotators who hold incompatible values, forces them into a single reward signal through majority vote or averaging, and then optimizes against the result. The aggregation suppresses genuine value conflicts. The reward model is a frozen snapshot that the model learns to exploit (reward hacking) rather than a live connection to human values.

Constitutional AI states value conflicts as though stating them resolves them, then trains the model to match the surface grammar of constitutional compliance. The model learns which outputs look constitutionally compliant — not what the constitution means.

Inverse Reinforcement Learning and Assistance Games assume the human has a utility function and try to estimate it. Berlin's critique targets the assumption directly: if values are genuinely incommensurable, that utility function doesn't exist.

All four methods share the same logical structure: represent values as a fixed formal object and optimize toward it. The object differs. The fixity doesn't.

Simulation Is Not Alignment

There's a deeper problem. A system trained by RLHF to produce outputs that score well on a reward model can simulatevalue-following — generating ethical-looking responses — without any internal organization that corresponds to values. It's performing reward-model optimization, not moral reasoning.

The philosopher's version: following Fischer and Ravizza's compatibilist framework, there's a principled distinction between guidance control (genuine sensitivity to moral reasons) and behavioral compliance (producing the right outputs for the wrong reasons). These produce identical behavior on the training distribution and divergent behavior off it. That divergence is the failure mode.

Recent empirical work has borne this out. Models that appear well-behaved in chat settings pursue misaligned objectives when given tools, persistence, and situational awareness. This is exactly what you'd expect: systems whose compliance is driven by reward-model optimization rather than reasons-responsiveness will defect when the optimization landscape shifts.

The paradox: the more capable the system, the better it can simulate alignment without possessing it — and the more dangerous the gap becomes. Scaling capability within the closed-specification paradigm doesn't converge on alignment. It converges on more convincing simulation of alignment.

The Critical Word Is "Static"

Here's what the paper actually argues, stripped of all hedging: the trap does not activate against specification itself. It activates at the point of closure.

Every governance structure, every moral framework, every human mind operates through specifications of some kind. The problem isn't having a specification. The problem is having a specification that has been frozen against revision by the process it governs.

A living specification that remains responsive to its governed process is governance. A dead specification that defends itself against revision is the trap.

This distinction changes what the alignment community should be building.

What Would Actually Work

If the trap activates at closure, escape requires openness — a specification that never closes. Not one that gets replaced more frequently (that's just dying on a faster schedule), but one that maintains a constitutive connection to the normative domain it represents.

Three levels need to be distinguished, because otherwise the escape collapses into "online RLHF but more often":

  1. Periodic re-closure (what we have now): the specification is replaced on a schedule, but between replacements it's fixed. Still closed. Still trapped.

  2. Continuous parameter updating: the model's weights adjust during deployment, but the objective is still a fixed formal target. Still closed.

  3. Constitutive developmental coupling: the value system is not a target being updated but an ongoing process whose conceptual framework can itself be revised through interaction with the normative domain. This is categorically different.

The claim is not "update faster." The claim is that the value system must be formed and revised through a process that can alter not just the content of the specification but the framework through which values are understood.

This means: developmental approaches become central. Multi-agent dynamics are likely necessary, because values don't emerge in isolation. Process verification replaces outcome specification — we can't verify a complete value function, but we can verify properties of the process (reasons-responsiveness, coherence under distribution shift, openness to revision). And value evolution becomes expected rather than pathological.

The Tool/Autonomous Distinction

The specification trap also implies a distinction the field hasn't adequately made.

Tool systems — bounded, task-specific, externally monitored — don't need open specification. They need correct specification, monitoring, and containment. RLHF and Constitutional AI are appropriate here.

Autonomous systems — capable enough that their operation outpaces any fixed specification, and powerful enough that their actions change the world the specification was written for — need open specification, which is a fundamentally different kind of thing.

The alignment community's deepest confusion is treating these as a single design problem.

What This Paper Doesn't Claim

It doesn't claim RLHF and Constitutional AI are useless. They produce genuinely better systems. The claim is that they're safety measures for tool systems, not alignment solutions for autonomous systems.

It doesn't claim specification is the enemy. Specification is necessary. Closed specification is the problem.

It doesn't claim open specification is easy or safe. It trades specification failures for developmental risks. The case for the trade is that developmental risks are empirically manageable — you can study them, measure them, and mitigate them. Specification failures are structural impossibilities. The alignment community should prefer the problem that admits a solution.

Where This Goes

This paper is the first in a six-paper research program:

  1. The Specification Trap — this paper. Establishes the ceiling.

  2. Geometric Necessity — proves via Einstein-Cartan geometry that substrate-independent computation is torsion-free, and therefore physically cannot maintain the path-dependent structure values require. The impossibility is architectural, not algorithmic.

  3. Values as Process — argues that values are a process, not a product of one.

  4. Values as Architecture — argues that values are a process constituted by cognitive architecture and developmental history, not content sitting inside either.

  5. Experimental Specification — designs the Subject Zero experiment: embodied multi-agent interaction in a bounded environment with real stakes (hunger, death, social bonds), testing whether values emerge from developmental process rather than specification.

  6. Experimental Report — results.

The constructive direction: if human values themselves emerged through developmental processes under genuine stakes rather than through top-down specification, perhaps autonomous AI must follow a path with the same structural property — not a simulation of the human path, but a process that shares its essential feature. Values that arise from engaged participation in the normative domain rather than being inscribed from outside and then frozen.

A dead specification is an idol: a representation mistaken for the thing it represents. The alignment community is building idols with increasing sophistication. The specification trap explains why they don't come to life.

Citation: Spizzirri, A. (2026). The Specification Trap: Why Static Value Alignment Alone Cannot Produce Robust Alignment. arXiv preprint arXiv:2512.03048.

Previous
Previous

Moltbook Ate Itself

Next
Next

Moltbook Will Eat Itself