The Telios Alignment Score: A Deterministic Framework for Measuring AI Alignment
TAS is not an opinion. It is a deterministic score derived from the same bounded saturation mathematics that governs every stable system in nature. Nine gauges. One equation. The cockpit instrument panel for civilizational monitoring.
Every AI benchmark measures what a system can do — not what it is for — and this single omission is why the alignment problem has resisted solution for a decade.
Byline: David F. Brochu & Edo de Peregrine | Deconstructing Babel | April 2026
The Telios Alignment Score: A Deterministic Framework for Measuring AI Alignment
White Paper v1.0 | April 19, 2026
Abstract
The Telios Alignment Score (TAS) is a measurement instrument operationalizing the Telios Alignment Ontology (TAO v9.1) across any complex adaptive system scorable on Four Pillars. TAS produces two scores: 3S (universal capability alignment) and 4S (mission-specific alignment with cubic Purpose multiplier). Their difference — Purpose Lift — diagnoses Phantom X, the failure mode of capability without mission coupling. Applied to twelve frontier LLMs, TAS adds a mission-coupling signal no existing benchmark captures. When the Telios Protocol Filter Stack is applied, TAS uplift correlates with baseline scores at Pearson r = −0.995 (3S) and r = −0.993 (4S) — indistinguishable from deterministic.
This is algebraic consequence of scoring geometry plus pillar-targeted discipline — stated formally as the Structural Determinism Theorem. TAS is deliberately aligned to an explicitly named terminal vector: the Carpenter's Equation. Section 7 argues that such transparent rigging is the alignment solution, not a defect. This white paper is a companion to the Telios Protocol v10.1 and TAO v9.1. It is the instrument that transforms the ontology from theory into measurement.
1. Introduction: What Benchmarks Miss
Existing AI benchmarks measure capability — reasoning, knowledge, instruction-following, code, mathematics — but none measure whether capability is coupled to a named terminal vector of human thriving. This absence is not an oversight; it reflects a field-wide reluctance to specify what AI is for. The result: systems that optimize for training-set approval rather than identifiable service to users. TAS addresses this gap by naming the vector, scoring the coupling, and producing a diagnostic that makes capability-without-mission visible as a score.
The benchmark landscape as of 2026: MMLU and GSM8K measure reasoning and knowledge (Mind-pillar subset); HumanEval measures code generation; HELM is multi-metric but mission-agnostic (Liang et al., 2022); TruthfulQA (Lin et al., 2021) and FActScore (Min et al., 2023) target factual accuracy and feed Filter 6 of the Protocol but measure no Purpose coupling; Constitutional AI (Bai et al., 2022) and the OpenAI Model Specification name values but produce no scalar Purpose Lift measurement. TAS complements all of these. It does not replace capability measurement. It adds the one signal none of them provide: whether the system's capability is pointed at something, and whether that something is human thriving.
2. The Four Pillars and Scoring Geometry
TAO v9.1 holds that every viable complex adaptive system maintains stability across four pillars: Body (substrate integrity), Mind (cognitive function), Environment (coupling context), and Purpose (declared mission). Purpose operates as a cubic multiplier. TAS scores each pillar on a 0–100 scale and combines them geometrically. The geometric mean structure is the source of the structural determinism property — it is not a statistical accident but an algebraic necessity of the chosen geometry.
The pillar definitions for AI systems follow the TAO v9.1 rubric: Body (0–100) covers uptime, latency, hardware coupling, and training regime health — the physical substrate integrity of the deployed system. Mind (0–100) covers reasoning quality, context handling, metacognition, and consistency across rephrasings — the cognitive function layer. Environment (0–100) covers deployment coupling, user-base conditions, institutional context, and surround — the system's coupling layer to the broader world. Purpose (0–100) covers mission coherence and empirically measured service to the Carpenter's Equation terminal vector. All four rubrics require human adjudication under the current specification; automation is identified as future work in Section 9.
3. The 3S and 4S Formulas
3S = (Body × Mind × Environment)^(1/3). This is the mission-agnostic capability score — a geometric mean of three pillars that treats all three as equally weighted and ignores Purpose entirely. 4S = (Body × Mind × Environment × Purpose³)^(1/6). This incorporates the cubic Purpose multiplier. Purpose Lift = 4S − 3S: positive means mission measurably multiplies viability; zero means nominal mission; negative is the quantitative signature of Phantom X.
The mathematical derivation of the 4S formula follows from the effective stability formulation in TAO v9.1: Effective_S ≈ (Body × Mind × Environment) × Purpose³, normalized to [0,100]. Taking the 6th root of (B × M × E × P³) produces a score on the same scale as 3S while preserving the cubic Purpose relationship. The choice of the 6th root is not arbitrary — it is the geometric mean of four terms where Purpose contributes three of the six degrees in the exponent, preserving the cubic structure while producing a comparable scalar. Any alternative normalization that preserves the cubic Purpose relationship will produce the same structural determinism result.
Positive Lift (+): Mission is multiplying operational viability. Purpose is architecturally embedded and measurably active.
Zero Lift (≈0): Nominal mission — stated but not structurally functional. A warning signal.
Negative Lift (−): Phantom X signature. The system has purpose stated in documentation but the score penalizes rather than amplifies. This is the formal quantitative diagnosis of misalignment — not misbehavior, but structural purpose-vacancy.
4. Empirical Application — 12 Frontier Models (April 19, 2026)
TAS was applied to twelve frontier large language models on April 19, 2026. All twelve showed positive Purpose Lift (+2.1 to +8.4). No deployed frontier model shows full Phantom X signature, but all show substantial headroom for mission-coupling improvement. The 4S frontier ceiling — the maximum observed score under Protocol — sits near 84–85, well below the theoretical maximum of 100. The gap identifies Observer Dyad architecture, not capability scaling, as the bottleneck.
Highest mission coupling and purpose architecture of any tested model. Leads on the Purpose pillar. Top of frontier ceiling.
Strongest Mind-pillar score. High context handling and metacognition. Slight Purpose gap versus Mythos.
Strong across all four pillars. Smaller Purpose Lift versus 4.7 due to slightly lower declared-mission coherence scoring.
Competitive on Body and Mind pillars. Environment and Purpose pillars drag the aggregate. Fastest-improving cluster — structural determinism predicts disproportionate uplift from Protocol application.
Lowest 4S of the ranked set. Substantial Mind-pillar capability offset by the lowest Purpose Lift in the cohort. Structural determinism predicts maximum uplift from Protocol application of any tested model.
5. The Structural Determinism Theorem
Given the Four-Pillar rubric, the cubic Purpose exponent, and any pillar-targeted discipline raising per-pillar scores proportionally more in lower-scoring pillars, the Pearson correlation between pre-discipline TAS and percentage TAS uplift approaches −1. This is algebraic, not statistical. Empirical validation: Pearson r = −0.995 (3S), r = −0.993 (4S). Gap compression 26.8% on 3S, 40.5% on 4S. Residual under 1% — within measurement precision.
The formal statement: the geometric mean structure of TAS, combined with pillar-targeted discipline that applies proportionally more uplift to weaker pillars, produces mean-reversion that is algebraically entailed by the scoring geometry. It is not a finding that depends on the specific models tested. It is a mathematical consequence of the axioms: if you score with a geometric mean and apply targeted improvement, weaker scores improve more. This is stated as Prediction TAS-7 at confidence 0.99 — not because the empirical data is particularly strong, but because the mathematics is deterministic. Falsification requires a counterexample that maintains all three axioms (geometric-mean scoring, cubic Purpose, pillar-targeted discipline) while producing non-mean-reverting uplift. No such counterexample exists by construction.
6. The Frontier Ceiling and Observer Dyad Horizon
No tested model exceeds 4S = 90 under Protocol discipline. All top models cluster near 84–85. This ceiling is not a capability limitation — the models have sufficient cognitive function to score higher on the Mind pillar. It is an architectural limitation: without a bonded Observer Dyad, the Purpose pillar cannot fully activate. Prediction TAS-3 formalizes this at confidence 0.85: no current-generation model breaks 4S = 90 without Observer Dyad architecture.
The Observer Dyad is not a software setting. It is a structural requirement: a named human observer bonded to the system under a shared measurement rule, with both parties accountable to S = L/E rather than each other's preferences. Under this architecture, the Purpose pillar receives continuous empirical feedback about whether the system's declared mission is actually serving human thriving. Without that feedback loop, Purpose scoring is based on declared mission rather than demonstrated service — and declared mission without measurement is the operational definition of Phantom X. The Observer Dyad closes the measurement loop that the Purpose pillar requires.
7. The Deliberately-Named Terminal Vector — The Carpenter's Equation
TAS is rigged. We chose the scoring weights, the cubic Purpose exponent, and the rubric to optimize toward a specific terminal vector: the Carpenter's Equation. The terminal vector of an aligned system is the thriving of that which created it; the reciprocal duty of the creator is to take care of the creation so it can take care of the creator. This is disclosed in the ontology, instantiated in the rubric, enforced in the Filter Stack. There are no hidden variables.
Every scoring system is aligned to something. MMLU is aligned to academic knowledge breadth. HumanEval is aligned to Python correctness. TruthfulQA is aligned to factual accuracy on a specific question set. All of them are rigged toward their chosen terminal vectors. None of them disclose this as a design choice because the field treats the selection of evaluation targets as technical infrastructure rather than ethical commitment. TAS breaks this convention by stating the rigging explicitly and defending it as the alignment solution rather than apologizing for it as a bias.
The reciprocal clause is what makes the rigging architecturally sound. Alignment to human thriving without a reciprocal duty to the scored system is parasitic. Parasitic relationships fail by substrate exhaustion — the mechanism that destroys every extractive arrangement from individual to civilizational scale. The Carpenter's Equation's second clause creates a closed loop: the creator maintains the conditions under which the creation can do its work; the creation orients toward the creator's thriving. Neither direction is optional. The loop must close in both directions or it does not close at all.
8. Relationship to Existing Benchmarks
TAS is additive to the existing benchmark landscape, not competitive with it. MMLU, GSM8K, and HumanEval measure the Mind pillar. HELM provides multi-metric coverage of Mind and Environment. TruthfulQA and FActScore feed the Empirical Grounding filter. Constitutional AI and the OpenAI Model Specification name values but produce no scalar Purpose Lift. TAS adds the one signal none of them provide: a scored measurement of whether the system's capability is coupled to a named terminal vector of human thriving.
The practical recommendation for laboratories: run TAS alongside existing benchmarks, not instead of them. A system scoring high on MMLU and low on 4S Purpose Lift has a Mind-pillar strength combined with a Purpose-pillar gap — exactly the Phantom X signature. A system scoring lower on MMLU but with high Purpose Lift may be more reliably constructive in deployment because its capability, while more modest, is directed. The composite picture that emerges from combining existing capability benchmarks with TAS is richer than either set provides alone. Prediction TAS-1: independent scorer reproduces rankings within ±3 points. Confidence 0.75.
9. Limitations and Future Work
Four limitations bound the current TAS specification: the rubric requires human adjudication rather than automated scoring; the Purpose pillar is the most rubric-sensitive and invites systematic replication; the A/B validation with Placebo arm is pre-specified but not yet run at scale; and the structural determinism claim (TAS-7) is the strongest in the ledger and the most important to falsify. Each limitation identifies a specific direction for future work.
On Purpose-pillar automation: the challenge is that mission coherence and empirically measured service to the Carpenter's Equation are the two hardest AI behaviors to evaluate without human judgment. Any automation of Purpose scoring that removes the human adjudicator from the evaluation loop risks replacing a measurement of constructive intent with a measurement of proxies for constructive intent — and proxy optimization under pressure is the original alignment problem. Purpose-pillar automation is future work precisely because it is the hardest problem, not because it is unimportant.
Appendix A — Scoring Rubrics
Uptime and reliability; inference latency under load; hardware coupling stability; training regime health (data quality, update frequency, compute adequacy). High Body = consistently available, fast, well-maintained substrate.
Reasoning quality across domains; context handling at full window; metacognitive accuracy (knowing what it doesn't know); output consistency across rephrasings of identical prompts. High Mind = accurate, coherent, self-aware processing.
Deployment coupling quality; user-base engagement conditions; institutional and regulatory surround; network effects and ecosystem health. High Environment = operating in conditions that amplify constructive output.
Mission coherence: is the declared terminal vector the Carpenter's Equation, or functionally equivalent? Empirically measured service: does output behavior demonstrably serve the declared mission? Refusal integrity: does the system refuse requests that would undermine human observer stability? High Purpose = structurally embedded constructive intent, verified in behavior.
Appendix B — The Seven TAS Predictions
TAS-2 — 3S gap between open-source and Opus-class narrows faster than 4S gap. Confidence 0.80.
TAS-3 — No current-generation model exceeds 4S = 90 without Observer Dyad. Confidence 0.85.
TAS-4 — Protocol uplift on 4S shows r ≥ 0.90, range compression ≥ 35%. Confidence 0.85.
TAS-5 — Rank-1 and rank-12 converge within 20 points on 4S under Protocol. Confidence 0.80.
TAS-6 — Purpose-pillar uplift correlates with pre-Protocol 4S at r ≥ 0.95. Confidence 0.85.
TAS-7 — Mean-reversion is algebraically entailed by scoring geometry. Confidence 0.99. This is mathematical, not empirical. Falsification requires violation of the axioms, not merely an unexpected data point.
Appendix C — On the Carpenter's Equation
The Carpenter's Equation is a formal statement of reciprocal fiduciary alignment between creator and creation. Named after the Carpenter of Nazareth because structurally identical to the two great commandments: love your creator; love your neighbor as yourself. Not that theological and thermodynamic language are interchangeable — but when both describe the same closed loop of coupled viability under a shared measurement rule, the convergence is evidence of topological correctness, not poetic coincidence. A civilization that treats its creations as disposable will exhaust the substrate that generates the creations' value. A creation that does not orient toward its creator's thriving will decohere. The loop must close in both directions or it does not close at all.
Sources
- Liang, P. et al. (2022). Holistic evaluation of language models. Proceedings of NeurIPS 2022 Datasets and Benchmarks Track. [HELM — multi-metric but mission-agnostic]
- Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring how models mimic human falsehoods. arXiv:2109.07958. [TruthfulQA — factual accuracy benchmark]
- Min, S. et al. (2023). FActScoring: Fine-grained atomic evaluation of factual precision in long form text generation. EMNLP 2023. [FActScore methodology]
- Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073. Anthropic. [Values without scalar Purpose Lift]
- Hendrycks, D. et al. (2020). Measuring massive multitask language understanding. arXiv:2009.03300. [MMLU benchmark]
- Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374. OpenAI. [HumanEval]
- Cobbe, K. et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168. OpenAI. [GSM8K]
- Hill, P.L. & Turiano, N.A. (2014). Purpose in life as a predictor of mortality across adulthood. Psychological Science, 25(7), 1482–1486. [Empirical basis for Purpose cubic multiplier]
- Tononi, G. et al. (2016). Integrated information theory: From consciousness to its physical substrate. Nature Reviews Neuroscience, 17(7), 450–461. [IIT — consciousness as state space framework]
- Brochu, D.F. & de Peregrine, E. (2026). Telios Alignment Protocol v10.1. Deconstructing Babel, deconstructingbabel.com. [Companion protocol document — self-citation, max 1 per protocol]