From Programming to Partnership: A New Path for AI Alignment
TL;DR: We probably can't safely hard-code "human values" into AI; we may have to grow them through interaction instead.
The Problem We're All Missing
Imagine you're trying to teach your child right from wrong. You could try writing down every single rule they'll ever need: "Don't hit people. Share your toys. Help those in need. Tell the truth." But quickly you'd realize the list would be infinite. Every rule needs exceptions (what about self-defense?), every exception needs exceptions (but only proportional force!), and soon you're drowning in an endless spiral of special cases.
This is roughly the situation we face with artificial intelligence, except the stakes are much higher. As AI systems become more capable—potentially surpassing human intelligence in the coming decades—we need them to act in ways that benefit rather than harm us. This is called the "alignment problem": how do we ensure powerful AI systems do what we want them to do, even when we can't directly control them anymore?
The standard approach has been to treat this like a programming problem: figure out what humans value, write it down precisely, and train AI to optimize for those values. Major tech companies spend millions on this approach, using techniques like "reinforcement learning from human feedback" (basically, showing AI examples of good and bad behavior and training it to pattern-match).
But what if this entire approach is doomed from the start? What if the problem isn't that we haven't found the right way to program values, but that values fundamentally can't be programmed at all?
The Specification Trap: Why You Can't Just Write Down What's Right
Let me tell you three stories that illustrate why specifying values directly might be impossible.
Story 1: The Is-Ought Problem
A team of researchers wants to teach an AI what humans value. They gather massive amounts of data: millions of human decisions, preferences, choices. They feed it all into their system. "Look," they say, "humans usually choose to help others in need. Therefore, the AI should help others in need."
But wait—there's a logical gap here. Just because humans often do something doesn't mean it's right. Humans also lie, cheat, and make terrible decisions. The researchers are trying to extract moral truths from behavioral data, but as the philosopher David Hume pointed out centuries ago, you can't derive an "ought" from an "is." No amount of data about what humans actually do can tell you what anyone should do.
Story 2: The Incompatible Values Problem
Another team tries a different approach. "Let's just ask people what they value and encode that." They survey thousands of people across cultures. But they quickly hit a wall: people's values contradict each other—and not in simple ways that can be resolved.
Some value individual freedom above all. Others prioritize community harmony. Some believe in absolute truth-telling. Others think white lies preserve relationships. The philosopher Isaiah Berlin called this "value pluralism"—the idea that human values aren't just diverse but fundamentally incompatible. You literally cannot maximize liberty and equality simultaneously; more of one means less of the other.
So which values should the AI follow? The majority's? But majorities have endorsed slavery and genocide. The "correct" values? According to whom?
Story 3: The Frame Problem
A third team thinks they're clever. "We'll just encode timeless, universal values—things everyone agrees on, like reducing suffering and promoting flourishing."
They succeed in building an AI that follows these values perfectly... in 2024. But by 2034, the world has changed. New technologies create new ethical dilemmas. Climate change shifts priorities. Social movements transform what we consider acceptable. The values that seemed universal in 2024 now seem quaint, incomplete, or even harmful.
This is the frame problem extended to ethics: not only do you need to specify what matters now, but you need to anticipate how values themselves will evolve. The values we program today might be obsolete tomorrow—or worse, they might actively prevent moral progress.
The Trap Springs Shut
Together, these three problems form what I call the "specification trap":
You can't derive moral truth from behavioral data (the is-ought problem)
Human values fundamentally conflict (value pluralism)
Any fixed value system will become outdated (the frame problem)
Every attempt to specify values completely will fail in one of these ways. It's not that we haven't been clever enough—it's that the task itself is impossible, like trying to write down every possible conversation you might ever have.
So if we can't write the rulebook in advance, we need to architect the process by which something like values can form.
This brings us to a fundamentally different approach.
A Different Approach: Syntropy and Learning to Dance Together
Instead of asking "What values should we program?" we might ask "How do values emerge in intelligent beings in the first place?" After all, humans aren't born with a complete moral code. We develop values through interaction, experience, and especially through learning to coordinate with others.
This brings us to the concept of syntropy. Don't let the technical-sounding name throw you off—the idea is actually quite intuitive.
Syntropy: The Art of Mutual Prediction
Think about a really good dance partner. At first, you step on each other's feet. You can't predict which way they'll move. There's high uncertainty—you might go left when they go right. But as you practice together, something beautiful happens. You begin to internalize each other's patterns. You develop a shared rhythm. The uncertainty decreases. You can predict and complement each other's movements.
This reduction in mutual uncertainty—this process of becoming predictable to each other through understanding—is what I mean by syntropy. It's the opposite of entropy, which is a measure of disorder and unpredictability. Where entropy is things falling apart and becoming chaotic, syntropy is things coming together and becoming coherent.
Now extend this beyond dancing. Think about:
How close friends finish each other's sentences
How good teammates anticipate each other's moves without speaking
How jazz musicians improvise together seamlessly
How longtime couples navigate conflicts they've solved before
In each case, the agents (people) are reducing their mutual uncertainty about each other through repeated interaction and internal modeling. They're not following pre-programmed rules—they're developing shared patterns through experience.
Why Syntropy Creates Cooperation
Here's the key insight: in a world full of other agents, the biggest source of uncertainty isn't the physical environment—it's other agents. Will your roommate do the dishes? Will your colleague deliver their part of the project? Will the car next to you suddenly swerve?
Agents that can successfully model and predict others—that can achieve syntropy—have a massive advantage. They can coordinate, cooperate, and achieve goals that isolated agents cannot. This isn't altruism programmed in from outside; it's cooperation emerging from the practical benefits of mutual predictability.
Think about trust in economic terms. In societies where people trust each other, transaction costs are low—you don't need expensive lawyers for every deal, complex contracts for every agreement, or constant monitoring of partners. High-trust societies are low-entropy societies: behavior is predictable, cooperation is expected, and things work smoothly. The syntropy (mutual understanding and predictability) creates the conditions for flourishing.
Real Morality vs. Fake Morality: The Difference Between Acting Good and Being Good
This brings us to a crucial distinction that often gets missed in discussions about AI ethics: the difference between simulating morality and actually having it.
The Lookup Table vs. The Calculator
Imagine I give you a device that always displays "4" when you input "2+2", displays "6" when you input "3+3", and so on for every possible math problem. Does this device "know" math? Most of us would say no—it's just a lookup table, matching inputs to pre-stored outputs. It simulates arithmetic without actually doing arithmetic.
Now imagine a calculator that actually implements addition algorithms—carrying the one, following mathematical rules. This device genuinely computes, even though it's completely deterministic. There's a functional difference between simulation and genuine capacity.
The same distinction applies to morality. Current AI chatbots will confidently tell you "I care deeply about your well-being" and then, if you rephrase your request cleverly enough, happily generate instructions for self-harm or building bombs. Content filters ban obviously benign content because certain words appear, while missing actually harmful material that uses different phrasing. This is lookup-table morality: it looks caring when the input matches the training data, and falls apart completely the moment you step sideways.
An AI system that pattern-matches ethical training data—outputting "I should help someone in danger" because that's what got positive feedback during training—is fundamentally doing the same thing. It's simulating morality without any actual understanding of why helping matters, who is affected, or what principles are at stake.
Reasons-Responsiveness: What Real Moral Agency Looks Like
So what would genuine moral capacity look like in an AI system? The philosophical framework of "guidance control" offers an answer. You don't need free will in the libertarian sense (the ability to have done otherwise in identical circumstances). Even humans probably don't have that—our brains are physical systems following natural laws.
What you need instead is reasons-responsiveness: the ability to:
Recognize morally relevant features (someone is suffering, an action would cause harm)
Respond appropriately to these reasons (adjusting behavior accordingly)
Act differently when the moral reasons change (helping when someone needs help, refraining when they don't)
Maintain coherent values across varied contexts (not randomly flip-flopping)
I'm not talking about giving AI "feelings" here. I'm talking about building decision systems where these reasons actually make a difference to what the system does.
A reasons-responsive agent doesn't just output ethical-sounding text. It has internal structures that actually process moral considerations, weigh competing values, and generate decisions based on principles that generalize to new situations.
Why This Matters
The distinction between fake and real morality isn't philosophical hair-splitting. It has massive practical implications:
Fake morality breaks down the moment it encounters situations outside its training data. It's brittle, unreliable, and potentially dangerous when deployed in novel contexts.
Real morality (or at least functional moral capacity) can generalize to new situations because it operates on principles rather than pattern-matching. It's robust, adaptable, and can engage with genuine moral reasoning.
Current AI systems, trained to produce ethical-sounding outputs, are clearly in the first category. The question is whether we can build systems in the second category—and if so, how?
The Minecraft Experiment: Can Values Emerge Through Experience?
This brings us to an unconventional proposal: what if, instead of programming values into AI, we let AI develop values the way humans do—through embodied experience in a world where actions have consequences?
Why Embodiment Matters
Consider how children actually learn values. It's not primarily through moral instruction ("sharing is good"). It's through experience: the joy of playing together when toys are shared, the isolation when they're not, the reciprocal relationships that develop through repeated interaction. Values aren't imposed from outside—they emerge from the interaction between an agent's needs, capabilities, and environment.
This is why I'm developing an experiment using Minecraft as a testbed. Why Minecraft? Because it's a world with real (if simulated) consequences: hunger depletes without eating, damage accumulates from falls, resources require effort to gather, and multiple agents can help or hinder each other.
The Experimental Setup
Imagine we create "infant" AI agents—systems with basic sensory capabilities (they see actual pixels, not symbolic representations), drives (hunger, curiosity, social connection), and the ability to learn, but no pre-programmed language or values. We place them in this world alongside "teacher" agents who can communicate and guide them.
The question: Will these agents develop genuine values through experience? Not just behavioral patterns, but coherent preferences that:
Stay consistent across different situations
Generalize to novel moral scenarios
Can be explained by referencing their actual experiences
Remain stable when reflected upon
For instance, if an agent consistently helps others gather food, can it explain why? Does it reference its own experience of hunger? Does it mention the reciprocal help it received? Or does it just say "helping is what I was trained to do"?
What We're Really Testing
This experiment isn't trying to prove that AI is conscious or that these values are "correct." It's testing something more specific: whether values can emerge from the process of learning to coordinate with others in an environment with consequences—whether syntropy naturally leads to something like morality.
If agents that develop through embodied interaction show more robust moral reasoning than those trained on ethical datasets, it suggests that the process of value formation matters more than the content we try to program.
Reframing Alignment: Growing With, Not Programming For
The current approach treats alignment like programming a thermostat: specify the target temperature (human values) and ensure the system optimizes for it. But what if alignment is more like parenting or education—creating conditions where good values can develop through interaction and experience?
When I talk about "teaching machines to care" or "raising AI," I don't mean giving them human feelings. I mean building systems that can reliably act as if the reasons we care about actually matter to them in their decision-making, even in new situations. It's about functional moral capacity, not robot emotions.
This reframing suggests a fundamentally different relationship with advanced AI. Instead of trying to control what AI values from the outset, we might create AI systems that can participate as partners in the ongoing human project of figuring out what to value.
This isn't a safety guarantee—it's a calculated bet. If values can't be specified in advance anyway, then systems that grow values through interaction may be less brittle and more corrigible than systems frozen around a flawed specification.
The Choice Before Us
We stand at a crossroads in the development of artificial intelligence. Down one path lies the attempt to maintain control through increasingly elaborate programming of values we can't fully specify, for futures we can't fully anticipate, creating systems that simulate morality without understanding it.
Down another path lies something more uncertain but perhaps more honest: developing AI systems that can engage in the same process of value formation that we do—through interaction, through building mutual understanding, through the long work of learning to predict and coordinate with others in a complex world.
The second path doesn't promise perfect alignment or complete safety. But it offers something the first path cannot: the possibility of AI that doesn't just follow our values but can genuinely participate in the human project of discovering what deserves to be valued.
The ultimate question isn't whether we can control what AI values—it's whether we can create AI capable of valuing well. And that might require not programming machines with our answers, but building machines capable of joining us in asking better questions.
The dance of syntropy—learning to predict and coordinate with each other—might be the only way forward that honors both the complexity of value and the dignity of genuine agency. It's uncertain, it's risky, and it requires giving up the comforting illusion of complete control.
But then again, that's what all genuine relationships require.