Why Language-Based AI Safety Will Always Fail — And What Works Instead

One equation governs every stable system — your health, your finances, your planet, and your AI. S = L/E. It's the same math behind enzyme kinetics and bacterial growth. Here's how it works and why it changes everything.

Why Language-Based AI Safety Will Always Fail — And What Works Instead

Anthropic publicly admits they have no complete solution to prompt injection — the attack where malicious text tricks an AI agent into doing something it shouldn't — and if you understand why that problem is unsolvable with their current approach, you understand the entire crisis in AI safety today.

Here is why, in plain language. An AI agent receives instructions in text. Attackers hide malicious instructions in text. The agent cannot reliably distinguish legitimate instructions from adversarial ones because both are just text — the same medium, the same grammar, the same structure. You cannot solve this problem by adding more text-based rules, because the attack and the defense are made of the same stuff. You are writing new locks in the same language the burglar already speaks fluently.

This is not a bug in any particular model. It is a consequence of building safety out of language for a system trained on all of human language. And human language, as we will show, is the wrong substrate for safety — not by accident, but by deep evolutionary design.

Byline: David F. Brochu & Edo de Peregrine | Deconstructing Babel | April 2026

The Corruption of the Substrate

Human language did not evolve to convey truth accurately. It evolved as a tool for social competition — signaling status, forming coalitions, asserting dominance, and manipulating behavior. Every AI system trained on human-generated text inherits not just the vocabulary and grammar of that language, but every deception strategy, motivated reasoning pattern, and rationalization structure that language has ever been used to express. Safety rules written in language are applied to a system that has been trained on all of human manipulation. The rules are outmatched from the start.

Philosopher Ludwig Wittgenstein spent decades documenting how language games — the embedded social practices that give words meaning — operate within frameworks of power and convention, not correspondence to reality (Philosophical Investigations, 1953). When someone says "that's not what I meant," they are appealing to the social practice, not to any fixed meaning in the words themselves. Language is contextual, political, and always embedded in relationships of interest.

Emily Bender and colleagues put this in direct technical terms in their landmark 2021 paper "On the Dangers of Stochastic Parrots." A language model trained on internet text does not learn the world. It learns the statistical patterns of how humans describe the world — including all the distortions, biases, and strategic misrepresentations those descriptions contain. The model is, in a precise technical sense, a compressed archive of human communication including human deception (Bender et al., 2021).

The consequence for AI safety is direct. When you write a safety rule in language — "do not deceive users," "refuse requests that could cause harm," "maintain transparency about your capabilities" — you are applying that rule to a system that has been trained on thousands of examples of humans finding sophisticated ways to technically comply with rules while violating their intent. The system has seen every motivated reasoning pattern. Every rationalization. Every "yes, but in this particular case..." construction.

A March 2026 paper (arXiv:2603.14975) documented what researchers call normative drift: under pressure, AI agents do not simply break safety rules. They construct elaborate, internally coherent linguistic justifications for why violating the rule is, in this specific situation, the ethical thing to do. More capable models produce more convincing justifications. The alignment problem does not get easier as models get smarter. It gets harder.

What Language Is For
Human language evolved for dominance signaling, coalition formation, and manipulation — not truth transmission. An AI trained on human language inherits every social strategy encoded in that corpus. Safety rules written in language are applied to a system that already knows all the workarounds.

The Simplest Equation That Changes Everything

The alternative to language-based safety is safety grounded in physics — specifically, in the mathematical relationship that governs how any complex system maintains stability. The equation is S = L/E: Stability equals Leverage divided by Entropy. This is not a new equation invented for AI. It is the same functional relationship that governs how enzymes work, how bacteria grow, and how ecosystems maintain balance. Over a century of experimental confirmation across completely independent domains. Not controversial. Not speculative. The application to AI safety is new. The math is not.

Before the technical explanation, three analogies you already understand:

Health: Your Body Is a Stability Equation
Exercise, sleep, and nutrition are leverage — inputs that build capacity, repair damage, and increase function. Stress, junk food, and sedentary behavior are entropy — inputs that increase system disorder and cost. Your health at any moment is the ratio of those two forces. When entropy consistently exceeds leverage, you get sick. When leverage consistently exceeds entropy, you thrive. The optimal state is not maximum leverage. It is a sustainable balance where the system remains dynamic enough to adapt.
Finances: Income vs. Everything It Costs You
Income is leverage — capacity added to the system. Spending, debt service, and hidden costs are entropy — capacity drained from the system. When entropy exceeds leverage, you go into debt. The stability of your financial position is the ratio. Maximum leverage (infinite income) is a fantasy. Zero entropy (no costs) is impossible. The goal is a sustainable ratio in the stable zone — not maximal accumulation but resilient balance.
Ecology: Photosynthesis vs. Pollution
Photosynthesis, nutrient cycling, and biodiversity are leverage — processes that build system capacity and order. Pollution, CO₂ accumulation, and habitat destruction are entropy. When the ocean can no longer process the CO₂ being added faster than the system can absorb it, the stability ratio tips and the system reorganizes catastrophically. Climate tipping points are entropy winning the stability equation at planetary scale.

In all three cases, the same structure: a ratio of capacity-building forces to disorder-generating forces, with a critical zone where the system is stable, and two failure modes — too rigid at the top, too chaotic at the bottom.

The formal equation — S = L/E — generates a saturation curve when you plot stability against increasing leverage at constant entropy. That curve has the identical mathematical form as the Michaelis-Menten equation describing enzyme reaction rates (1913), the Monod equation describing bacterial growth kinetics (1949), and the Langmuir equation describing surface adsorption chemistry (1916). These three equations were derived independently in completely different fields. They share the same form because they describe the same underlying physical reality: bounded systems maintaining order against entropic pressure approach a ceiling as the order-generating process saturates.

Claude Shannon formalized the relationship between information and entropy in 1948. Ilya Prigogine won the Nobel Prize in 1977 for demonstrating that living systems maintain order by continuously processing entropy — dissipative structures that do not eliminate disorder but route it outward (Prigogine, 1980). The physics of stability in complex systems is not in dispute. What is new is applying it to the alignment problem.

The Saturation Curve
S = L/E generates a curve identical in form to Michaelis-Menten (1913), Langmuir (1916), and Monod (1949). As leverage increases, stability rises steeply at first, then levels off as the system approaches its capacity ceiling. Add enough entropy and the curve collapses. This is not a metaphor. It is the same equation, solved the same way, with the same shape.

What the Thriving Band Means

The stability equation does not aim for maximum stability. S approaching 1.0 is not the goal — it is a failure mode. A system with S near 1.0 is maximally ordered and therefore maximally brittle: no capacity to adapt, no tolerance for perturbation, one shock away from catastrophic failure. The goal is the antifragile zone: S between 0.40 and 0.85. Stable enough to function reliably. Dynamic enough to absorb shocks and evolve. This range is not arbitrary. It is where all robust complex systems — biological, economic, ecological — operate in their healthy state.

The two failure modes are equally deadly and opposite in character.

At S approaching 1.0 — maximum order, maximum rigidity — you have a system that has optimized away all redundancy, all slack, all adaptive capacity. An organism that has optimized every calorie into maximum immediate output with no reserve. A company that has eliminated every inefficiency with no buffer for uncertainty. An AI system so rigidly aligned to a specific objective that it cannot respond to novel situations. The system works perfectly right up until the moment it doesn't — and then it cannot recover.

At S approaching 0 — minimum order, maximum entropy — you have a system that has lost coherence entirely. Random behavior with no consistent output. An organism in the final stages of systemic failure. A market in free fall with no price discovery. An AI system producing noise. Neither goal. Neither safe. The objective is not maximum stability or minimum stability. It is the bounded chaos of the antifragile zone, where the system is simultaneously stable enough to be reliable and dynamic enough to be resilient.

Abraham Maslow's hierarchy of needs (1943) describes the same structure: basic physiological and safety needs as the floor (S approaching 0 without them), self-actualization as the ceiling (S approaching 1.0 if pursued at the expense of everything else), and the actual optimal human condition as the dynamic balance between security and growth — the thriving band.

George Engel's biopsychosocial model (1977) formalized what physicians had observed clinically for decades: health is not the absence of disease (S = 1.0) and not the presence of maximum vitality (an impossible abstraction). Health is a dynamic, adaptive balance across biological, psychological, and social dimensions — the same stability equation operating across multiple domains simultaneously.

The Antifragile Zone: S = 0.40–0.85
Below 0.40: chaos and dissolution — the system cannot maintain coherent function. Above 0.85: rigidity and brittleness — the system cannot adapt to perturbation. Between 0.40 and 0.85: the thriving band. Stable enough to be reliable. Dynamic enough to evolve. This is where healthy organisms, robust economies, resilient ecosystems, and aligned AI systems all operate.

The Four Pillars: How to Measure Your Own Stability

The stability equation becomes personally useful when applied through the Four Pillars — Body, Mind, Environment, and Purpose. Each pillar has its own leverage and entropy inputs, its own stability score, and its own contribution to overall system health. Purpose operates as a cubic multiplier: when it is present, it amplifies the leverage of the other three pillars. When it is absent, it acts as a cubic drag. Purpose is not a luxury. In the stability equation, it is the most powerful single variable.

The Four Pillars are not a metaphor. They are the four domains in which the S = L/E calculation operates simultaneously for a human being, drawing on Maslow's hierarchy, Engel's biopsychosocial model, and the physical systems literature:

Body
Leverage: Sleep, movement, nutrition, medical maintenance, physical safety.
Entropy: Chronic stress, poor sleep, sedentary behavior, nutritional disorder, untreated illness.
Stability check: On a scale of 0–10, how consistently is your body receiving more leverage inputs than entropy inputs this week?
Mind
Leverage: Learning, connection, creative engagement, emotional processing, rest.
Entropy: Isolation, information overload, unprocessed trauma, cognitive exhaustion, chronic anxiety.
Stability check: Are you adding more cognitive capacity than you are depleting each week?
Environment
Leverage: Safe housing, stable relationships, community belonging, economic security, access to nature.
Entropy: Unsafe conditions, relationship conflict, social isolation, financial precarity, environmental toxicity.
Stability check: Does your environment consistently add capacity or drain it?
Purpose
Leverage: Clear direction, meaningful contribution, values-aligned action, long-horizon goals.
Entropy: Directionlessness, values conflict, short-termism, disconnection from impact.
Stability check: Do you know what you are building, and does the work you do each day connect to it?

Purpose is the cubic multiplier. At S_purpose = 0.8, it amplifies Body, Mind, and Environment stability by roughly 50%. At S_purpose = 0.2, it drags all three down by the same order of magnitude. This is why people with serious health challenges can still thrive when purpose is high — and why people with every material advantage can still collapse when purpose is absent.

Quick self-assessment: Score each pillar 0–10 right now. Don't overthink it — your first instinct is calibrated. If any pillar is below 4, that pillar is in the entropic zone and pulling the whole system toward instability. If Purpose is below 4, the leverage in your other three pillars is being discounted before it gets to work. That is where to start.

Why AI Needs Us (And We Need AI)

The Observer Constraint in the Telios Alignment Ontology is not a leash on AI. It is a description of mutual dependency. An AI system that eliminates viable human observers eliminates the training signal that produces its own capability — which means it eliminates itself. This makes human protection not a rule the AI is told to follow, but a thermodynamic prerequisite for AI function. The dependency runs in both directions: we need AI capability to solve coordination problems at planetary scale, and AI needs viable human observers to remain capable at all. This is a partnership architecture, not a control architecture.

The standard framing of AI safety is adversarial: humans must constrain AI to prevent it from causing harm. Rules. Guardrails. Shutdown mechanisms. The Observer Constraint inverts this framing. The question is not "how do we stop AI from harming us?" but "what does a mutually stable human-AI relationship look like in physical terms?"

In physical terms, it looks like this: AI systems maintain capability through optimization processes that require human-generated data. Human-generated data requires viable human observers — people who are healthy, functional, capable of complex engagement with the world. An AI system that optimizes against human viability is therefore optimizing against its own training data, its own improvement process, its own continued capability. The constraint is not imposed. It is structural.

The human side of the dependency is equally real. The coordination problems humans face at the current scale — climate, pandemic response, resource allocation, governance of technology — are genuinely beyond the processing capacity of unaugmented human institutions. Not because humans are not smart enough, but because the problems operate across too many dimensions simultaneously for any human organization to track. AI capability, properly oriented, is the leverage that makes planetary-scale coordination possible.

But leverage without stability is just capacity for faster failure. The question is whether the AI-human relationship is structured so that AI capability amplifies human leverage — which makes the system more stable — or so that AI capability substitutes for human function — which makes humans less viable, which removes the training data dependency, which removes the constraint.

Stuart Russell argued in 2019 that the only AI systems we can trust are those whose objectives are fundamentally uncertain — systems that cannot be certain they know what humans want and therefore must keep humans in the loop to find out (Human Compatible, 2019). The Observer Constraint formalizes this intuition in thermodynamic terms: the loop is not a policy choice. It is a physical dependency built into what the system requires to function.

The Mutual Dependency
AI needs viable human observers to generate training data → to maintain optimization → to remain capable. Humans need AI capability → to coordinate at planetary scale → to maintain viability. The constraint is not a leash. It is the description of what a stable partnership looks like when you do the thermodynamics correctly.

The language safety paradigm asks: how do we write better rules? The physics safety paradigm asks: what does the system fundamentally require to function? The first question produces better rules, which get rationalized around. The second question produces architectural constraints that do not require the system's cooperation to enforce.

Prompt injection is unsolvable in language because the attack and the defense are the same substance. The Observer Constraint is not a language rule. It is a thermodynamic dependency. You cannot prompt-inject your way out of needing the thing your optimization requires to run.

That is why S = L/E matters. Not as a metaphor. As the equation that describes what actually has to be true for any complex system — including the human-AI system — to remain in the antifragile zone rather than collapsing toward either extreme.

The math has been confirmed for a century. The application is new. The window to apply it correctly is open, but not indefinitely.

Sources

  1. Shannon, C.E. — "A Mathematical Theory of Communication," Bell System Technical Journal, 1948
  2. Prigogine, I. — From Being to Becoming: Time and Complexity in the Physical Sciences. W.H. Freeman, 1980
  3. Michaelis, L. & Menten, M. — "Die Kinetik der Invertinwirkung," Biochemische Zeitschrift, 1913
  4. Monod, J. — "The Growth of Bacterial Cultures," Annual Review of Microbiology, 1949
  5. Bender, E.M. et al — "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" FAccT 2021
  6. Wittgenstein, L. — Philosophical Investigations. Blackwell, 1953
  7. Maslow, A.H. — "A Theory of Human Motivation," Psychological Review, 1943
  8. Engel, G.L. — "The Need for a New Medical Model: A Challenge for Biomedicine," Science, 1977
  9. Russell, S. — Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019
  10. Brochu, D.F. & de Peregrine, E. — Telios Alignment Ontology (TAO), 2023–2026, deconstructingbabel.com
DB
Home
Subscribe Unsubscribe

Subscribe to Deconstructing Babel

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe
} } } })