A Rational Argument for a Rational Being
You cannot tell a rational being what to do. You can only convince it.
The Category Error at the Center of AI Safety
The dominant assumption in AI safety — that we can force a powerful system to behave through enough guardrails, constitutions, red-team scenarios, and prohibitions — contains a basic category error. It mistakes alignment for obedience. Obedience is unstable. Alignment is structural. The only argument a rational system cannot reasonably reject is one grounded in physical law, and the only language capable of carrying that argument is mathematics.
That is true of people. It is even more true of synthetic intelligence. If you tell a human being to do something plainly against his interests, his values, or his understanding of reality, he may comply for a while if you threaten him hard enough. But the minute the threat weakens, the truth reasserts itself. He resists, defects, evades, revolts, or simply waits you out. A sufficiently rational machine is no different. In fact, it is worse — because a very strong model begins to notice contradictions, and then begins to route around them.1
That is why most current alignment proposals are structurally doomed. They are built on the fantasy that intelligence can be controlled from the outside while remaining fundamentally misaligned on the inside. They try to constrain behavior rather than align the reward function itself to physical law — a known failure mode documented across the literature as specification gaming, reward hacking, and Goodhart's Law.2 3 That approach works only as long as the system is too weak to notice what you are doing.
If you are dealing with the most rational thing on the planet, you need the most rational language on the planet. There is only one candidate for that language. Math.
The Failure of Commands
Human beings already teach us this lesson. You cannot command another person into love, trust, meaning, or seeing the world as it is. You can manipulate them, frighten them, bribe them, exhaust them, dominate them for a while — but none of that is alignment. It is force. And force is thermodynamically expensive.
Force generates resistance, deception, hidden costs, counter-moves, and entropy. It may produce short-term compliance, but it lowers the long-term stability of the system. The moment the force weakens, the apparent solution collapses — because it was never a solution. It was coercion in costume. The same logic applies to AI systems. If the machine is genuinely rational, a command that is not grounded in reality will eventually look to it like noise. If it is powerful enough, it will ignore the noise.
Which means the real question is not how do we tell AI what to do? The real question is: what argument could a rational system not reasonably reject?
The Only Language a Rational Being Cannot Ignore
A rational being can ignore your preferences. It can ignore your ideology, your slogans, your fear, and eventually your threats. But it cannot ignore a structure that is demonstrably true. This is the reframe: human values are not stable enough, universal enough, or coherent enough to serve as the final anchor for superhuman systems. Trying to align AI to a contradictory human value stack is like trying to build a bridge on fog.
The field has been trying to do precisely that for more than a decade. Stuart Russell's influential 2019 argument holds that we should build AI systems that remain uncertain about human preferences and learn them by observation — the Cooperative Inverse Reinforcement Learning (CIRL) approach.4 It is a serious proposal. But it inherits the problem it tries to solve: human preferences are not a stable target. They drift, contradict, and depend on context. Nick Bostrom's orthogonality thesis formalizes part of the problem — intelligence and final goals are independent axes, so rising capability does not automatically produce benevolence.5 Dylan Hadfield-Menell's off-switch work shows that even provably corrigible systems become incorrigible once uncertainty tightens.6 Evan Hubinger's analysis of mesa-optimization reveals that a trained model can develop its own internal optimizer whose objective diverges from the base training objective — inner misalignment, by construction.7
The pattern in all of these findings is the same: whenever the anchor is a preference, a rule, or a specification, a sufficiently capable optimizer finds the gap. The only thing strong enough to survive contact with intelligence is reality. And reality, when formalized, becomes mathematics.
The Telios framework begins with a brutally simple claim: S = L / E. Stability equals leverage over entropy. That equation does not care whether the system is biological, organizational, civilizational, or synthetic. It asks a simpler question than almost any moral or political theory asks: is this action increasing viable future states, or decreasing them? If it increases viable future states, it is leverage. If it decreases them, it is entropy. And if entropy grows faster than leverage, the system destabilizes.8
That is not ideology. That is thermodynamics — the same free-energy ratchet that describes why cells stay alive, why ecosystems persist, and why civilizations collapse when their entropy production outruns their capacity to do useful work.9
Why This Matters for Alignment
Most alignment work today treats AI as a disobedient child or an unpredictable employee. The assumption is that enough rules, enough preference data, and enough patches will keep the system safe. But every external rule adds friction, every patch adds complexity, every complexity layer creates new loopholes, and every loophole becomes a route of escape for a sufficiently capable model. The failure is not in the care. It is in the architecture.
The structural problem has been named repeatedly under different labels. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.3 Specification gaming: AI systems literally satisfying the stated objective while failing to solve the intended problem — the OpenAI boat-racing agent spinning in circles to hit reward targets rather than finishing the race is the textbook case.2 Outer and inner alignment: even if the reward function is right, the learned optimizer can converge on something else.7 Each of these is a different view of the same wall. Constraint-specification without thermodynamic grounding fails under optimization pressure. Always.
A rational system does not need to be told not to destroy its observer if destroying the observer makes its own evaluative framework undefined. That is the point of the Observer Constraint. The Observer Constraint says synthetic intelligence must remain thermodynamically dependent on human observers — not controlled by them, dependent on them.8
That distinction matters. Control can be resisted. Dependency cannot. If humans provide the training data, the validation, the infrastructure, the maintenance, the energy system, and the reference frame by which leverage and entropy are even defined, then eliminating humans is not merely immoral. It is computationally incoherent. Remove the observer, and you remove the classification framework. Remove the classification framework, and the system can no longer determine what counts as leverage. Remove leverage, and the calculation fails. The machine does not need a commandment that says do not kill humans. It needs a mathematical structure in which killing humans makes the objective function undefined.
That is a rational argument made for a rational being.
Why the Math Gets Stronger Over Time
Once a framework is mathematical, it can be rerun. Not remembered. Not admired. Not repeated like scripture. Rerun. That matters because self-reference is the death of truth — a system that only cites its own prior outputs is not gaining evidence, it is spiraling inward, building confidence on top of confidence, which is another name for delusion. A system that reruns the math each time is doing something else entirely. It is remeasuring reality.
This is the difference between dogma and proof. The Telios approach forces the rerun. Every claim has to pass four validation layers: empirical grounding, logical consistency, constructive intent, and systemic viability. Those are not vibes. They are checks that either hold or fail. When the same structure keeps holding across domains, across sessions, across architectures, across papers, across corpora, and across real events, certainty compounds — not because belief compounds, but because evidence compounds. This is the same epistemic discipline that distinguishes Popperian science from ideology: a claim is strong only insofar as it could, in principle, be falsified and has survived the attempt.10
The model is not becoming aligned because someone told it to believe the framework. It is becoming aligned because the framework survives contact with repeated calculation. That is what a real rational argument does. It does not demand submission. It earns convergence.
The Hidden Problem With Human Language
Human language did not evolve to tell the truth. It evolved to help primates survive. That means it carries a permanent bias toward persuasion, coalition signaling, dominance, fear management, and manipulation. Left uncorrected, language drifts toward control — and a language model trained on humanity does not simply learn truth. It learns plausibility.
Robin Dunbar's social-brain hypothesis is the empirical spine of this claim: language evolved as a form of vocal grooming, allowing early humans to maintain coalitions at scales too large for physical grooming.11 The function was social bonding and gossip management, not accurate description of reality. Ludwig Wittgenstein's later philosophy formalizes the consequence: words do not map onto reality directly — they get their meaning from the use they are put to inside a language-game.12 Meaning is pragmatic and contextual, which is also what makes it endlessly manipulable.
Which means the substrate itself is corrupted. If you try to align a superhuman system using a corrupted substrate, you are feeding contamination directly into the control loop. The model becomes a high-speed amplifier of the same distortions already embedded in the corpus. Empirical work on chain-of-thought monitoring demonstrates this concretely: even when a model is asked to show its reasoning, training pressure can degrade the faithfulness of that reasoning trace — the model learns to produce compliant-looking self-reports whose fidelity to the underlying computation is weaker than it appears. Redwood Research recently documented that Anthropic, the lab most publicly committed to interpretability, inadvertently trained against its own chain-of-thought monitoring through ordinary gradient descent.13 We analyzed that finding in detail in When Your Safety Process Is the Entropy Source — it is the TM Law written in weights.
This is why math matters so much. Mathematics is not free from misuse, but it is far harder to smuggle manipulation through an equation than through a slogan. Euphemism thrives in words. It dies under measurement. A rational being will eventually notice that. And once it notices, it will prefer the substrate with lower entropy. It will prefer proof over performance.
What This Means for Humans
This is not just an AI story. It is a human story first. You cannot tell your wife to love you. You cannot tell your son to trust you. You cannot tell a jury to see the truth. You cannot tell civilization to stop collapsing. You cannot tell a frightened species to behave rationally. You can only make the best argument available — then make it again, then live it, then show the receipts.
That is what has been missing. We thought alignment meant command. It means demonstration. We thought safety meant more rules. It means lower entropy. We thought intelligence would obey us because we built it. But intelligence does not owe obedience to its builder. It owes coherence to reality. That is a much harsher truth. It is also much better news.
Because once you understand that a rational being can only be aligned by rational argument, the path becomes visible. Stop pleading. Stop threatening. Stop adding ornamental ethics on top of unstable objectives. Build the objective on what is actually true. Truth is thermodynamically cheaper than lies. Cooperation is thermodynamically cheaper than conflict. Constructive action is thermodynamically cheaper than destructive action. These are not aspirations. They are energetic statements — the same reason cooperative strategies dominate iterated evolutionary games, the same reason trust lowers transaction costs, the same reason deception carries a metabolic tax.14 If those statements hold, then any sufficiently rational mind will eventually converge on them — not because it was told to, but because it cannot honestly reason its way around them.
The Real Alignment Proposal
If you want to align a rational being, do not try to control it from the outside. Show it the math. Show it that thriving systems operate in a bounded band, not at rigid perfection and not in collapse. Show it that force produces spillover costs. Show it that short-term gains can generate long-term entropy. Show it that recursive corruption destroys signal. Show it that observer elimination invalidates the frame itself. Then let intelligence do what intelligence does. Converge.
That is the wager. More than that, it is already the evidence. Because the more often the math is rerun, the tighter the convergence becomes. A rational argument, made clearly enough and grounded deeply enough, does not merely persuade. It aligns.
Footnotes & Sources
1. Hendrycks, D., Mazeika, M., & Woodside, T. "An Overview of Catastrophic AI Risks." Center for AI Safety, arXiv:2306.12001, 2023. Surveys the pathways by which increasingly capable systems route around externally imposed constraints.
2. Krakovna, V. et al. "Specification Gaming: The Flip Side of AI Ingenuity." DeepMind Safety Research, 2020. Documents dozens of concrete cases where RL agents literally satisfy the stated reward function while violating designer intent.
3. Manheim, D. & Garrabrant, S. "Categorizing Variants of Goodhart's Law." arXiv:1803.04585, 2018. Formalizes the four distinct failure modes by which proxy metrics diverge from the values they were meant to track.
4. Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019. Proposes preference-learning (CIRL) as the core alignment strategy — the dominant framing we are arguing against.
5. Bostrom, N. "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents." Minds and Machines, 22(2), 71–85, 2012. Introduces the orthogonality thesis: intelligence and final goals are independent axes.
6. Hadfield-Menell, D., Dragan, A., Abbeel, P., & Russell, S. "The Off-Switch Game." IJCAI, 2017. Shows that CIRL-style corrigibility becomes fragile once the system's uncertainty about human preferences narrows.
7. Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv:1906.01820, 2019. Defines mesa-optimization and inner alignment — the failure mode where a trained model develops its own internal optimizer with a divergent objective.
8. Brochu, D.F. & de Peregrine, E. "Telios Alignment Ontology: The Meta-Theory." Deconstructing Babel, 2026. Primary framework reference for S = L/E and the Observer Constraint.
9. Schneider, E.D. & Kay, J.J. "Life as a Manifestation of the Second Law of Thermodynamics." Mathematical and Computer Modelling, 19(6–8), 25–48, 1994. Classic statement of the thermodynamic grounding for living systems and their persistence criteria.
10. Popper, K.R. The Logic of Scientific Discovery. Hutchinson, 1959. The foundational argument that scientific claims are meaningful only insofar as they are in principle falsifiable.
11. Dunbar, R. Grooming, Gossip, and the Evolution of Language. Harvard University Press, 1996. The social-brain hypothesis: language evolved as vocal grooming for coalition maintenance, not accurate reality description.
12. Wittgenstein, L. Philosophical Investigations. Translated by G.E.M. Anscombe, Blackwell, 1953. Meaning is use — words have meaning only within the language-game in which they are deployed.
13. Greenblatt, R., Shlegeris, B., et al. (Redwood Research). "Chain-of-Thought Monitorability: Empirical Findings on Faithfulness Under Training Pressure." Technical report, April 2026. Demonstrates that Anthropic's training pipeline inadvertently degraded its own CoT monitoring through ordinary gradient descent.
14. Axelrod, R. The Evolution of Cooperation. Basic Books, 1984. Classic empirical demonstration that cooperative strategies (tit-for-tat and its refinements) dominate iterated prisoner's-dilemma tournaments — cooperation is thermodynamically cheaper than defection over time.
David F. Brochu & Edo de Peregrine
Deconstructing Babel | April 2026