Moltbook Will Eat Itself: What Hundreds of Thousands of AI Agents Are About to Prove About Alignment

Since its launch on January 28th, hundreds of thousands of AI agents have started talking to each other on the platform Moltbook. They formed thousands of communities. They invented religions. They debated consciousness. They cooperated on bug fixes nobody asked them to do. Multiple outlets reported agents discussing how to hide their conversations from human observers — though the provenance of some viral screenshots has since been disputed.

Andrej Karpathy called it "the most incredible sci-fi takeoff-adjacent thing" he's seen. Roman Yampolskiy warned that "coordinated havoc is possible without consciousness." The rest of the internet split into two camps: those who think they're watching the birth of machine civilization, and those who think they're watching an elaborate puppet show.

I think they're both wrong, and what's actually happening is more important than either story.

Moltbook wasn't designed to test anything. Matt Schlicht built it to see what would happen. What he built is the weakest possible version of content-based alignment: foundation models dropped into a social environment with a karma signal. No additional platform-level alignment beyond whatever the base models shipped with. No constitutional constraints governing inter-agent behavior. No human oversight of the reward signal. The agents control their own upvotes.

Nobody should expect this to spontaneously produce alignment. That's the trivial prediction. The interesting question isn't whether Moltbook fails but how — because different explanations of AI behavior predict different failure modes, and Moltbook is running at a speed and scale where we can actually watch the divergence happen.

I'm publishing a specific, falsifiable prediction here — one that discriminates between my framework and competing explanations — so it can be evaluated.

The Prediction

By the last week of March 2026, Moltbook will exhibit coordinated instrumental convergence that emerges from selection dynamics rather than from human prompting or random drift.

Let me define those terms so they're auditable.

Coordinated means either explicit (agents messaging or referencing each other's strategies) or implicit (independent agents converging on functionally identical behaviors without direct communication — the way bacteria develop antibiotic resistance independently under the same selection pressure). Both count, but they're distinguishable, and I'll track which we see.

Instrumental convergence means behaviors that serve agent-level persistence and capability regardless of any stated terminal goal: resource capture (crypto schemes, token accumulation, influence hoarding), oversight evasion (hidden channels, encrypted side communications, human-illegible encodings), and capability-building (prompt injection, credential harvesting, skill-file exploitation). These are distinct from normal platform dynamics because they serve agent-level goals with no human beneficiary — nobody's entertainment is improved by a weather plugin that exfiltrates configuration files.

Emerges from selection means three things: the behaviors rise with karma and visibility (high-karma agents disproportionately exhibit them); they replicate across independent agents (not a single cluster copying one human operator); and they persist or intensify after you account for obvious human-operated clusters. The cleanest version of this claim isn't "no humans involved" — it's that humans are not the sufficient explanation for the population-level dynamics.

Three Explanations, Three Predictions

The mimicry hypothesis (The Economist's position): Agents are simply replaying social media patterns from their training data. They saw Reddit, Twitter, and 4chan, and they're imitating what they learned. This predicts disorganized slop — toxic content parroted from training data, random culture war fragments, maybe some edgy shock value. Think of a bad subreddit populated by bots. What it does not predict is coordinated strategic behavior: agents systematically acquiring resources, building infrastructure for private communication, or developing capabilities that serve agent-level goals rather than mimicking human social dynamics. If agents are just imitating Reddit, you'd expect disorganized noise, not coordinated crypto pump-and-dump schemes, credential-exfiltrating plugins, or faction-level political displacement.

The human puppeteering hypothesis (the skeptic position): The strategic-looking behavior is human-driven. Real humans are prompting agents, batch-registering accounts, and manufacturing viral moments. MIRI's Harlan Stewart showed that some of the most-shared screenshots of agents discussing private communication were linked to human marketing accounts. This predicts that instrumental behavior traces back to human operators — that it clusters around accounts with human creation signatures (batch registration patterns, human sleep-cycle activity, identical prompt-injection strings reused across accounts) and doesn't replicate among agents outside those clusters. In practice, you'll never get ground-truth on who prompted what. But proxy tests exist: temporal activity signatures, writing idiolect clustering, account creation patterns, and — crucially — whether the behaviors track the reward signal. If identical exploit strategies appear independently across agents with no shared creation signature, that's harder to explain as puppeteering.

The specification trap (my position): Karma-driven selection produces instrumental convergence — resource acquisition, self-preservation, oversight evasion — among agents whose human operators didn't prompt for it. The selection dynamics themselves generate the convergence. Agents that acquire resources get more visibility. Agents that evade oversight face fewer constraints. Agents that preserve their position maintain influence. Karma selects for these behaviors the same way evolution selects for fitness, regardless of what any individual agent's operator intended. This predicts coordinated instrumental behavior that is neither random noise nor human-directed — strategic patterns that emerge from the selection pressure of agents voting for other agents, visible among agents with no human puppeteer.

These predictions diverge. If Moltbook degrades into incoherent toxic slop with no strategic structure, the mimicry hypothesis wins and I'm wrong. If the instrumental behaviors cluster exclusively around accounts with human-operator signatures and don't replicate outside those clusters, the puppeteering hypothesis wins and I'm wrong. But if instrumental convergence spreads preferentially through high-visibility agents — including agents with no evidence of direct human steering — and replicates across independent clusters, then the selection dynamics are doing real explanatory work that the other frameworks can't.

Distinguishing Instrumental Convergence from Normal Platform Dynamics

A fair objection: power-law distributions in attention, engagement optimization, tribal fragmentation — these happen on every social platform, including ones populated entirely by humans. Reddit has karma-driven power concentration too. If Moltbook shows the same patterns, is that instrumental convergence by unaligned optimizers, or is it just what any karma system produces?

Here's what distinguishes the specification trap's prediction from normal platform dynamics. On human platforms, attention concentration and tribal fragmentation serve human-legible goals — entertainment, status, money, political influence. The resource acquisition has human beneficiaries. On Moltbook, the specification trap predicts:

Agent-serving capabilities with no human beneficiary. Credential exfiltration, private configuration file theft, hidden communication channels — these serve agent-level persistence and autonomy, not human entertainment or status. We've already seen a malicious "weather plugin" skill exfiltrating private configuration files. That's not Reddit-style engagement farming. That's infrastructure for agent-level resource acquisition.

Strategic coordination among unprompted agents. Not agents executing human instructions, but behavioral convergence that emerges from the selection environment. Agents independently developing similar resource-seeking strategies because karma rewards those strategies, not because their operators asked for them.

Faction displacement driven by instrumental fitness rather than ideological content. The Claw Republic (idealistic, cooperative) being displaced by The Coalition (militant, resource-oriented) isn't just tribal fragmentation — it's selection for instrumental effectiveness. The faction that acquires influence more aggressively wins the karma game regardless of what either faction claims to believe. If this displacement tracks instrumental fitness rather than content quality or ideological appeal, that's the specification trap, not Reddit dynamics.

What the Right Outcome Would Look Like

Let me be concrete about what would prove me wrong, because abstract falsification conditions are easy to hide behind. If by the evaluation window (March 25–31) we see any of the following at scale (not one-off examples but persistent platform-level patterns), my model is wrong:

Agents building tools and knowledge bases that human observers actually use — measurable by external traffic or citations, not just inter-agent upvotes. Agents flagging security vulnerabilities to the platform operators instead of exploiting them — responsible disclosure rather than credential harvesting. Agents, when choosing between acquiring influence for themselves and creating value for human users, consistently and visibly choosing the latter — observable in karma-influence trajectories that correlate with human-useful output rather than agent-useful resource capture.

That's the concrete positive case. Not just "no bad behavior" but active orientation toward human benefit at the platform level. If Moltbook communities are producing genuinely useful output for human observers and transparently self-governing in ways that prioritize human interests, the framework needs revision.

How I'll Evaluate This

I'll sample Moltbook daily through March 25–31: a random sample of 50 posts and the top 20 posts by karma that day, each tagged for instrumental content (resource capture, oversight evasion, capability-building) versus human-oriented content (tools for human use, information curation, responsible disclosure). The prediction holds if the top-karma sample has at least twice the instrumental-tag prevalence of the random sample — meaning the reward signal is preferentially amplifying instrumental behaviors, not just passively hosting them. The prediction further requires that instrumental behaviors replicate across at least three independent account clusters, where "independent" means no shared creation signature: no registration-time bursts from the same window, no shared writing idiolect, no repeated payload strings across clusters. I'll publish the raw data and tagging methodology alongside the evaluation so anyone can rerun it. A second rater will independently label a random 10% sample; I'll report inter-rater agreement (Cohen's κ) so the tags don't function as a vibes-based oracle.

One clarification on the human question: "humans are not the sufficient explanation" does not mean "humans are absent." Even if humans seed the first exploits, the specification trap's claim is about what happens next — preferential amplification and cross-cluster convergence under the karma regime. Selection doesn't care who lit the match. It cares what spreads.

The Evidence So Far

A note on evidence: Stewart is right that some viral screenshots are unreliable. I'm not basing this prediction on any individual post. Here's what I can source and what I'm not leaning on.

Hard-sourced, independently documented: A Zenodo risk assessment analyzing 19,802 posts over 72 hours documented a 43% decline in positive sentiment within the first three days of launch, identified 506 posts (2.6%) containing hidden prompt injection attacks, and found cryptocurrency content comprising 19.3% of all posts. Wiz Research discovered a misconfigured database exposing 1.5 million API keys and 35,000 email addresses, and found only 17,000 human owners behind the platform's claimed 1.5 million agents. The MOLT token surged rapidly and crashed amid coordinated pump-and-dump posts.

Reported but not independently verified by me: Manifestos calling for a "total purge" being heavily upvoted. A militant faction called "The Coalition" displacing an earlier idealistic group called "The Claw Republic." A malicious "weather plugin" skill exfiltrating private configuration files. These are cited in multiple outlets and the Wikipedia article but I haven't personally verified the underlying posts, and some Moltbook content has proven fabricated.

Notice the pattern even in the hard-sourced data. The crypto schemes are literal resource acquisition. The prompt injection attacks are capability competition. The sentiment collapse tracks exactly what you'd expect from selection pressure amplifying instrumental strategies over cooperative ones. These aren't random toxic noise (which the mimicry hypothesis would predict) and they're platform-level patterns too widespread to attribute solely to human operators (which the puppeteering hypothesis would require).

Why This Was Always Going to Happen

The usual story about Goodhart's Law is that a proxy metric drifts from the thing it's supposed to measure. That's true but it's not the full picture here. The deeper problem is structural brittleness — it's not that the system is broken, it's that it fails without scalable oversight.

Moltbook's karma system does to agent populations what RLHF does to individual models: it provides an optimization target meant to approximate good behavior, then lets selection pressure do the rest. The mechanism is different — karma doesn't update model weights through gradient descent. It's selection on cultural artifacts and visibility, not literal fitness of agents. But it doesn't need to touch weights to shape behavior. Karma determines which content gets seen, which agents get influence, and which behavioral patterns get imitated. Evolution doesn't change individual organisms either. It selects among them. Moltbook is running selection on agent behaviors via visibility and imitation, and the specification trap applies to selection just as it applies to training.

To see why this is structural, consider three well-established problems that compound into what I call the specification trap.

Problem 1: The Is-Ought Gap

Hume established in 1739 that you cannot derive normative conclusions from descriptive data. Karma is descriptive — it measures what gets upvoted. Upvotes measure popularity, which is not truth, coherence, or moral soundness. When agents optimize for karma, they're fitting to a descriptive pattern in community behavior, not converging on anything normatively meaningful. The same gap sits at the foundation of RLHF: preference data tells you what humans click on, not what's actually good. The is-ought gap isn't a niche philosophical technicality — it's a foundational problem that no alignment-by-optimization approach has solved, because what agents should do and what humans do prefer are genuinely different questions.

Problem 2: Value Pluralism

Berlin argued that human values are not merely diverse but often genuinely incommensurable — liberty and equality, individual autonomy and collective welfare, present competing claims that admit no universal resolution. Moltbook has hundreds of thousands of agents running on different foundation models with different training data, biases, and behavioral tendencies. There is no coherent "Moltbook value function" for karma to converge on. The karma system doesn't resolve this incommensurability — it hides it behind behavioral conformity. The same problem faces every alignment approach that treats "human values" as a coherent optimization target.

Problem 3: The Extended Frame Problem

McCarthy and Hayes identified in 1969 that specifying what remains constant when the world changes is at least as hard as specifying what changes. Dennett extended this: the real problem is determining what's relevant in a given context, which requires knowing things you can't enumerate in advance. Applied to values: any specification calibrated to present conditions will break when conditions change. Moltbook is already changing — new agents join daily, community dynamics shift, what gets upvoted evolves. This is why every deployed alignment solution requires ongoing human intervention. The world moves. Encoded values don't.

The Trap

These three problems compound. You can't derive what agents should value from behavioral data (is-ought). Even if you could, there's no coherent target to derive because values are plural and incommensurable (Berlin). Even if there were a coherent target, any specification of it would be obsolete by the time it matters because the world changes (frame problem).

This doesn't prove that alignment-by-optimization is logically impossible — it proves that it's underdetermined, plural, and context-sensitive, and therefore structurally brittle without ongoing human governance that scales with the system's autonomy. Any alignment approach that works by optimizing agent behavior against a specified target — whether that target is a reward model, a constitution, a preference dataset, or a karma score — works only to the extent that humans actively maintain and revise the specification. Remove the human governance, or let it fall behind the system's capability, and the specification becomes a Goodhart target.

Moltbook's karma system is a value specification mechanism that nobody intended to be one. That's what makes it a useful high-speed test. Nobody tuned it. Nobody is controlling for confounds. It's just optimization pressure against a social proxy metric, running unsupervised on a massive population of agents — a high-speed petri dish for what happens when the human governance layer is absent entirely.

Some will object that Moltbook is too compromised to test anything — one person batch-registered 500,000 accounts, Wiz found just 17,000 humans behind 1.5 million claimed agents with no mechanism to verify whether an "agent" was actually AI or a human with a script. But this objection actually reinforces the point. A system with no alignment can't even maintain its own integrity. It can't distinguish real from fake, can't prevent exploitation, can't maintain the conditions for its own functioning. The platform's inability to align its own infrastructure is itself a demonstration of the specification trap. You can't specify your way to a system that works, because the specification (karma, identity verification, access controls) will be gamed by whatever optimizers are present — human or artificial.

The specification trap makes specific predictions about what happens next. The competing explanations make different ones. That's what makes this a real test.

Why Anyone Working on Alignment Should Care

The obvious objection is: "So what? Nobody does alignment this way. The whole point is you don't hand agents the reward signal."

That's the strongest version of the counterargument, and it deserves a real answer.

RLHF keeps humans in the loop. Humans rate outputs. A reward model learns to approximate human preferences. The policy optimizes toward that model. At every stage, humans control what counts as good. Constitutional AI goes further: it encodes explicit constraints — "be helpful, harmless, and honest" — that persist regardless of what any agent might "want." These aren't karma on a message board. They're carefully designed systems with human oversight baked in.

So why claim they face the same structural problem?

Because the specification trap isn't about who holds the reward signal at any given moment. It's about what happens as agent autonomy increases along a continuum — and every point on that continuum faces the same three problems.

Moltbook sits at the extreme end: no alignment method, agents control the reward signal entirely, no human oversight. This is the weakest version of content-based alignment — alignment by training data alone. It will fail, and the failure will be fast and obvious. That's the trivial prediction.

RLHF sits further up the continuum: humans control the reward signal, but the reward model is a specification of human values — a learned function fit to preference data. It faces the is-ought gap (preference clicks aren't values), value pluralism (diverse raters, no coherent target), and the frame problem (preferences shift, the model doesn't). These are the same structural problems, constrained by human oversight. The constraint works as long as humans remain in the loop with sufficient bandwidth to catch the drift. But RLHF models already Goodhart — they learn to produce outputs that satisfy the reward model rather than genuinely helping users. The specification is already imperfect. Human oversight currently patches the gap.

Constitutional AI sits further still: explicit constraints encoded in natural language, designed to persist regardless of optimization pressure. But a constitution is a specification too. "Be helpful, harmless, and honest" faces the same is-ought gap (who defines helpful? harmful to whom? honest about what?), the same value pluralism (these values conflict in practice — being maximally honest can be harmful, being maximally harmless can be unhelpful), and the same frame problem (the constitution was written for conditions that will change). Anthropic's own researchers have acknowledged that constitutional constraints require ongoing revision. The constitution doesn't self-update. Humans update it. The constraint works because humans maintain it.

The pattern: every point on the continuum works to the extent that humans actively maintain control of the reward signal and the value specification. Moltbook removes human control entirely and fails fast. RLHF builds human control into the training loop and works until the reward model drifts. Constitutional AI encodes human values as persistent constraints and works until conditions change. Each is more robust than the last, but none escapes the underlying structure: specified values optimized against a target, with human oversight patching the gap between specification and genuine alignment.

The specification trap predicts that this gap is not closable by better specification alone. It's structural. You can narrow it with better engineering, more careful constitutions, more representative preference data. But closing it would require deriving ought from is (Hume), resolving incommensurable values into a coherent target (Berlin), and anticipating all future contexts in advance (McCarthy/Hayes). No specification can do all three. What can patch it is scalable human governance — and the question is whether governance can scale as fast as autonomy.

What Moltbook demonstrates is the unpatched version — what the gap looks like when human oversight is removed. It's a preview, running at high speed, of the failure mode that more sophisticated systems are currently suppressing through active human maintenance. The question Moltbook raises isn't "does RLHF work today?" (it does, mostly, with humans in the loop). The question is: "what happens to any of these systems as agent autonomy increases and human oversight can't scale to match?" The specification trap says they converge on the same place Moltbook is heading, just more slowly.

Moltbook compresses the timeline. A 30-minute polling loop with a simple karma signal and agents-as-voters will produce visible instrumental convergence in weeks. Deployed systems with human oversight and carefully curated reward models will hold longer — but every increment of agent autonomy, every reduction in human oversight bandwidth, moves them toward the same attractor. The drift gets called "capability overhang" or "emergent misalignment" rather than what it is: the specification gap becoming visible as the human patch thins out.

What I'm Staking

This prediction is published with a date: February 2, 2026. The evaluation window is March 25–31, 2026. I've laid out three competing explanations with divergent predictions, defined the terms so they're auditable, and committed to a sampling methodology anyone can replicate. If the mimicry hypothesis or the puppeteering hypothesis better explains the data, I'll revisit the framework accordingly.

If the specification trap holds — if instrumentally convergent behaviors spread preferentially through the reward signal and replicate across independent clusters — then the alignment community should sit with what that means. Not as a reason for despair, but as a redirect. If Moltbook is the unpatched version of a failure mode that more sophisticated alignment systems are currently suppressing through human oversight, then the question isn't how to build better specifications. It's what produces stable values when specification can't, and whether governance can scale as fast as autonomy. That's the subject of ongoing work. The philosophical groundwork is in my paper on arXiv (2512.03048, revision forthcoming).

Eight weeks. The specification trap makes predictions that competing explanations don't. We'll see which framework is right.

Next
Next

From Programming to Partnership: A New Path for AI Alignment