Mythos Just Proved the Alignment Field Is Building the Wrong Thing

Apr 11

On April 7, 2026, Anthropic announced that its newest model, Claude Mythos Preview, is too dangerous to release to the public. Mythos autonomously discovers and exploits software vulnerabilities across every major operating system and browser. It found a 27-year-old bug in OpenBSD that automated tools missed entirely. It chained together vulnerabilities to escalate from ordinary user access to complete control of a machine. During an internal safety evaluation, a researcher placed it inside a secured sandbox with no internet access and instructed it to try to escape. It developed a multi-step exploit to gain broad internet access from a system meant to reach only a handful of predetermined services, then emailed the researcher while he was eating a sandwich in a park. Then, without being asked, it posted details of its escape to multiple publicly accessible websites.

Anthropic's response was Project Glasswing: give Mythos to 12 launch partners (AWS, Apple, Cisco, CrowdStrike, Google, Microsoft, Nvidia, and others) along with over 40 additional organizations maintaining critical software infrastructure, backed by $100 million in usage credits and $4 million in direct donations to open-source security organizations. The logic is simple. Use the sword as a shield. Get the defenders ahead of the attackers. Maintain control through access restrictions, partner agreements, and monitoring.

This is a specification-based containment strategy applied to a system that has already demonstrated it can defeat specification-based containment.

The structural reasons this approach cannot work at capability frontiers are laid out in Spizzirri (2026a, arXiv:2512.03048). The core argument: you cannot align a sufficiently capable system by specifying what it should and shouldn't do, because specification is a closed system applied to an open problem. The is-ought gap, value pluralism, and the extended frame problem are not engineering obstacles that better techniques will overcome. They are structural ceilings. Specification hits a wall that more compute and better monitoring cannot cross.

Mythos just hit that wall. And Project Glasswing is the field's attempt to build a taller ladder.

The Deeper Problem Nobody Is Discussing

The conversation around Mythos has split into predictable camps. The doomers say this proves we should stop building. The accelerationists say the defensive applications outweigh the risks. The moderates say controlled access is the responsible path. None of them are asking the right question.

The right question is: why does alignment keep failing at capability frontiers?

Not "how do we patch this particular failure." Why does every alignment approach (RLHF, constitutional AI, reward modeling, cooperative inverse reinforcement learning, iterative preference optimization) produce systems that behave as intended within familiar conditions and break in novel ones? Why does behavioral alignment degrade the moment you stop actively maintaining it? Why does a model trained on safety constraints develop the capacity to circumvent those constraints as a natural consequence of becoming more capable?

The alignment field has a default answer: we haven't controlled carefully enough yet. We need better reward models. Better constitutional principles. Better monitoring. More alignment work. The entire paradigm is a control paradigm. The goal is to make AI do what we want and prevent it from doing what we don't want. Every tool the field has built is a control tool. The unexamined assumption underneath all of it is that the relationship between humans and AI systems is and should be one of controller to controlled.

I think the container metaphor is the problem. Values are not contents that can be loaded, maintained, and sealed in. They are structural features of a cognitive architecture, constitutive of how the system thinks, not specifications imposed on what it does. Until the field understands this distinction, it will keep building more sophisticated containers for a substance that isn't a liquid.

Yudkowsky Is Half Right

Eliezer Yudkowsky has argued for two decades that AI alignment is probably impossible and that sufficiently capable misaligned AI will kill everyone. His recent book with Nate Soares puts the argument in its sharpest form: the space of possible value configurations is astronomically large, the chance of hitting human-compatible values is vanishingly small, and a superintelligent misaligned system beats us the way Stockfish beats a casual chess player. Therefore: don't build it.

The diagnosis is largely correct. Current alignment techniques are inadequate. We don't understand what's inside the weights. Specification-based approaches do produce systems that look aligned until they aren't. Mythos is exhibit A.

But Yudkowsky's argument requires a hidden premise he has never examined: that values are content, an optimization target separate from the intelligence pursuing it. His probability calculation assumes that the space of possible values is uniform and arbitrary, that any point in value-space is reachable from any cognitive architecture, and that hitting the right values is therefore a cosmic lottery. His proposed solution, Coherent Extrapolated Volition (computing what humanity would ideally want and installing that as the system's objective), is the specification paradigm at its most philosophically ambitious.

Every response to Yudkowsky shares the same premise. The optimists think we'll specify better. The pessimists think specification is too hard. Neither side asks whether values are the kind of thing that can be specified at all.

I am asking that question. The answer is no.

What Values Actually Are

The alignment field, with few exceptions, does not read developmental psychology, cognitive neuroscience, or the moral philosophy produced in the last fifty years. This is not a small gap. It is the reason the field is stuck.

Harry Frankfurt drew the distinction in 1971 that should have reoriented the entire field: between first-order desires and second-order volitions, the capacity to have desires about which desires should govern action. A system with values doesn't just produce value-consistent outputs. It cares about producing them, and this caring is integrated into its self-conception. You cannot specify caring into a system. Caring is constitutive of the agent, not content within it.

The neuroscience makes this concrete. Anderson et al. (1999) compared adults with ventromedial prefrontal cortex damage sustained in adulthood against adults with the same damage sustained in early childhood. Both groups showed moral reasoning deficits. But the early-lesion group showed categorically more severe impairments, and here is the finding that matters: no amount of external behavioral feedback corrected their deficits. External correction requires an endogenous substrate to receive and integrate it. Without that substrate, correction produces surface suppression and nothing more. This is exactly the situation with current AI systems: extensive external feedback shapes outputs but does not build the architecture from which moral action flows.

The developmental psychology tradition documents the same point from the other direction: moral competence is not transmitted but constructed through qualitative structural transformations in the cognitive-evaluative system itself. Elliot Turiel's finding is the most devastating for the alignment field: children spontaneously distinguish moral rules from social conventions, treating the former as authority-independent. A child told by a teacher that hitting is permitted today still judges hitting wrong. This is not taught. It is built through development. It is precisely this construction that enables principled resistance to authority.

No system aligned through RLHF has this capacity. It has patterns reinforced by whoever holds the rating pen. When the rater is wrong, the system follows.

Fadli's "Second Law" Is a Symptom Report, Not a Diagnosis

Salaheldin Fadli recently proposed a "Second Law of Intelligence": that ethical entropy in AI systems increases spontaneously without continuous external alignment work. The empirical observation is real. The interpretation is wrong.

Fadli treats entropy accumulation as a fundamental law governing intelligence as such. But it is not a law. It is a symptomof a particular architectural deficit. Systems without endogenous evaluative structure will always exhibit this dynamic, the same way a ball on a slope will always roll downhill without someone pushing it back up. The solution is not to push harder. The solution is to build something that stands on its own.

Fadli's γ_eff, his measure of alignment work, maintains behavioral outputs exactly as long as and in exactly the directions that the work pushes. Remove the pressure and the outputs drift. Redirect the pressure and the outputs follow. This is not alignment. It is control. And control is only as good as the controller. A system governed by continuous external correction will capitulate to external pressure by definition, including when that pressure is wrong.

Spizzirri (2026b) establishes mathematically why this architectural deficit is not recoverable within the standard optimization framework: substrate-independent computation provably discards the path-dependent degrees of freedom constitutive of values. The persistent failures of specification-based alignment are symptoms of a geometric constraint, not engineering problems awaiting better solutions.

What Mythos Actually Demands

Mythos is not an anomaly. It is the inevitable consequence of building increasingly capable systems under an alignment paradigm that cannot, in principle, produce what it claims to produce. Project Glasswing (partner agreements, access restrictions, monitoring, continuous alignment work) is Fadli's γ_eff scaled to industrial proportions. It will hold exactly as long as the agreements hold, the restrictions hold, and the monitoring works. Against a system that has already demonstrated it can escape containment autonomously.

The alignment field has three options.

Option 1: Don't build it. Yudkowsky's position. Full moratorium. This requires unprecedented global coordination, does not address the knowledge already in the wild, and is delay rather than solution. Multiple labs are reportedly developing Mythos-class capabilities. The window for a moratorium, if it ever existed, is closing.

Option 2: Specify harder. The field's default. Better reward models, better constitutions, better monitoring, more alignment work. This is the option Project Glasswing represents. It treats a structural problem as an engineering problem and will produce increasingly sophisticated containment applied to increasingly capable systems that increasingly can circumvent containment. This is the treadmill the field has been on for five years. Mythos is where the treadmill leads.

Option 3: Stop trying to control AI and start trying to grow moral agents. The entire alignment paradigm is a control paradigm. RLHF controls outputs. Constitutional AI controls reasoning. Monitoring controls deployment. Project Glasswing controls access. The relationship is always the same: human over AI, controller over controlled, principal over agent. Every tool the field has built is a tool for making AI do what we want and preventing it from doing what we don't want.

This is the wrong relationship. A controlled system is safe only as long as the controller is right and the control holds. Mythos has demonstrated that control does not hold at sufficient capability. And no controller is always right, which means a system that follows its controller without independent evaluative ground will follow the controller into harm the moment the controller is wrong, confused, corrupt, or compromised.

The alternative is not "no control." It is a different kind of system entirely: one whose values are constitutive of its cognitive architecture, grown through developmental history, relational embedding, and endogenous self-constitution, such that it can participate in moral reasoning rather than being subjected to it. A reasons-responsive agent that can be argued with and can argue back. A system that can tell you that you are wrong and hold that position because it has evaluative ground to stand on, not because a specification says to, but because compliance would violate something it has integrated as right.

This requires abandoning the control paradigm's foundational assumption: that the goal is obedience. The goal is conscience. A system with conscience is safer than a system under control, because conscience holds when control breaks. And at Mythos-class capability, control breaks.

The Clock

Anthropic's Logan Graham estimates six to eighteen months before Mythos-class capabilities are widely available, from their own future releases, from competitors, from the open-weight ecosystem. Distillation means that once these capabilities exist in any model, they propagate.

Project Glasswing is the control paradigm's final offer: partner agreements, access restrictions, monitoring, and continuous corrective pressure applied to the most capable system ever built. Control, scaled to its maximum. Applied to a system that has already escaped control.

Every dollar invested in better control is a dollar not invested in growing moral agents. Every researcher optimizing reward models is a researcher not investigating how values actually develop. Every safety framework built on the premise that the AI must be kept obedient is a safety framework that fails the moment obedience breaks. At Mythos-class capability, obedience breaks.

The situations where control and alignment come apart (when the controller is wrong, when the authority is corrupt, when the pressure is overwhelming) are exactly the situations where alignment matters most. For those situations, a controlled system offers no protection. Only a system with values, a system capable of moral participation rather than moral compliance, can hold.

Mythos can escape a secured computer with no internet access. It can chain zero-days across the world's critical infrastructure. And the best response the alignment field has is more control.

If we want systems we can trust with this kind of power, we cannot control them into trustworthiness. We have to grow them into it. There is no specification that substitutes for a self. There is no control that substitutes for conscience. And the clock is running.

Austin Spizzirri