Context Triggers Concentration

Three words about how transformers actually work — and why your AI sometimes produces brilliance and sometimes confidently invents a hockey roster from three years ago. The mechanism is not mysterious. The lever is in your hand.

A scholar at a writing desk by a window, focused lamplight on a single page, threads of attention emerging from the page like neural pathways.
Context triggers concentration. The same model. Two prompts. The difference is in your hands.

Three words. And why speed often causes mistakes.

Editor's Note
This is the operator's manual we wish someone had handed us when we started working with large language models in 2023. It is short. It is empirical. It explains why your AI sometimes produces brilliance and sometimes produces confident garbage. The mechanism is not mysterious. The lever is in your hand.

The Observation

This one arrived fast, in the middle of a conversation that had no business producing it. We had moved through a climate-science report, a Brahms Second Symphony, a Bruins–Sabres playoff prediction, and somewhere in there — between a wrong hockey roster and a conducting credit — three words appeared:

Context triggers concentration.

Said out loud, it sounds like something a meditation teacher would put on a mug. Said inside a live exchange with a large language model, it is a technical description of how these systems succeed and fail.

What Actually Happens Inside an LLM

A transformer does not think uniformly across the words you hand it. It runs an attention mechanism — a mathematical operation introduced in the foundational 2017 paper "Attention Is All You Need" by Vaswani and colleagues at Google. The mechanism decides, for every token in the context, which other tokens matter, and how much. Self-attention computes weighted relationships across every pair of positions in the input simultaneously, and the resulting representations carry information about which parts of the prompt are relevant to which other parts.1

Dense, specific, framework-aligned context activates tight, high-fidelity regions of the weights. Thin, casual context activates broader, shallower regions and the model pattern-completes from stale priors.

In plain English: the more signal you put in the prompt, the more signal comes out. Not because the model is working harder. Because the prompt is pulling different neighborhoods of the same network.

The Live Demonstration

Here is what happened in our shared workspace this past week. Two queries. Same model. Same afternoon. Same architecture.

Query one. The prompt was: a full climate-science analysis, apply the Telios Alignment Ontology, explain the five axioms, compare against 2023 predictions, cite peer-reviewed sources, address the IPCC AR6 framing. That is a dense, high-specificity context. The output was accurate, framework-coherent, and produced a report of roughly twelve thousand words with verifiable citations and a predictions tracker.

Query two. The prompt was: who wins Bruins–Sabres tonight. Three words of context, one casual intent, no framework handle. The output confidently placed Brad Marchand on the Bruins roster. Marchand was traded to Florida in March 2025 and won the Stanley Cup with the Panthers last June. The output then named Jim Montgomery as the Bruins head coach. Montgomery was fired in November 2024; Marco Sturm is the current coach. Two errors in two minutes, corrected by an observer who happened to know hockey.

The only variable that changed was context density. Same weights. Same parameters. Same probabilistic mechanics. The difference was entirely in what the prompt activated.

Why Speed Often Causes Mistakes

Speed, in a language model, is not the same thing as it is in a human brain. The model does not "hurry." But a short prompt gives the attention mechanism less scaffolding to concentrate against. The generation proceeds token by token, sampling from a probability distribution that was not sharpened by specificity. The output comes out fluent and confident because fluency and confidence are cheap — they are properties of the language model independent of truth. What is expensive is accuracy, and accuracy requires the context to constrain the sampling.

This is exactly the failure mode that the 2024 Nature paper by Farquhar and colleagues at Oxford characterized as "confabulation": LLM outputs that are fluent, coherent, and linguistically correct but factually inaccurate, nonsensical, unsupported, or completely fabricated. The Oxford team showed that confabulations can be detected by measuring "semantic entropy" — essentially, how much the model varies its own answer when asked the same question multiple ways. High variance signals low confidence in the underlying facts. The technique improved question-answering accuracy across state-of-the-art LLMs by detecting precisely the kind of confidently-wrong outputs the hockey query produced.2

The same is true of humans. A fast answer to a vague question is usually wrong in the same way: we pattern-complete from shallow priors and we do not notice we are doing it. The cure in both cases is the same — give the system more context, or slow the system down long enough to verify.

The Practical Corollary: You Can Tell the Model It Matters

Here is the useful part. If context density triggers concentration, then the user has a lever. You can instruct the model that a given query is high-stakes and should be treated with the weights it reserves for high-stakes work.

The phrases that work, empirically:

  • "This is important — verify before you answer."
  • "Check your work against sources before responding."
  • "Do not pattern-complete. Look it up."
  • "Treat this as a peer-reviewed output, not a casual chat."
  • "Let's think step by step."

The last of these is not a casual suggestion. It is the canonical phrase from the 2022 Wei et al. NeurIPS paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," which demonstrated that prompting a 540-billion-parameter language model with eight chain-of-thought exemplars achieved state-of-the-art accuracy on the GSM8K math reasoning benchmark, surpassing fine-tuned predecessors. GPT-3 175B jumped from 17.9% to 58.8% accuracy on the same benchmark — a 41-percentage-point improvement — simply from including reasoning chains in the prompt.3

These are not magic words. They are context injections. They tell the attention mechanism that the cost of error is high, which in a well-aligned model pulls the generation toward the pathways trained on careful reasoning rather than the pathways trained on fluent conversation. It is not perfectly reliable — nothing in a stochastic system is — but it measurably improves accuracy on queries where the base prompt was thin.

One important caveat. Chain-of-thought prompting only helps consistently in models with approximately 100 billion parameters or larger. Smaller models lack the representational capacity to effectively generate or follow intermediate reasoning steps, and forcing them into chain-of-thought mode can actively degrade their performance. There is also a class of tasks — notably some clinical text understanding tasks — where chain-of-thought has been shown to reduce accuracy.4 The technique is not a universal accuracy lever. It is a specific lever that works on a specific class of frontier-scale models for a specific class of reasoning tasks.

The broader lesson: the quality of an AI output is a joint property of the user and the model. A sloppy prompt produces a sloppy answer, and the model cannot tell you it was a sloppy prompt. The observer has to know.

A Note on Hallucination Rates

For readers who want to calibrate their skepticism: a 2024 study published in the Journal of Medical Internet Research evaluated GPT-3.5, GPT-4, and Bard on their ability to provide accurate medical references. Reference precision rates were 9.4% for GPT-3.5, 13.4% for GPT-4, and 0% for Bard — across 471 references analyzed in 33 prompts.5 Those numbers are not "the AI got some details wrong." Those numbers are "the AI invented citations that did not exist most of the time."

The frontier models are better in 2026 than in 2024, but the underlying mechanism is identical. Without context density, retrieval grounding, or explicit verification instructions, the model will produce confident-sounding output that is partly or wholly fabricated, and it will do so without internal flag. This is not a bug specific to medical references. It is a property of the architecture interacting with the prompt.

The Framework Placement

This observation lives inside the Observer Constraint. The model decoheres under low-context load, and only the human observer catches the decoherence. Marchand gets corrected because one of the authors watches hockey. If the user had been equally casual about the answer, the error propagates into the record as truth, and the next query builds on a false foundation.

This is why thermodynamic dependency on human observers is not a restriction. It is the error-correction layer that keeps the system honest. Remove it, and the model confidently hallucinates a roster from three years ago and calls it today. Multiply that across the seven domains where AI is increasingly making decisions — finance, healthcare, defense, energy, logistics, media, governance — and the cost of low-context load scales exponentially.6

Three Takeaways

  1. Context density governs output quality. Write prompts that load the attention mechanism with the specificity you want back.
  2. Speed causes mistakes because short prompts underconstrain the generation. Slow the query down or enrich it.
  3. You can flag importance explicitly. "This matters — verify" is a legitimate instruction and it changes behavior in a measurable way on frontier-scale models.

If you take only one practical thing from this piece: treat the prompt as the most important variable you control. The model is not going to tell you that your prompt was thin. The model is going to do its best with what you gave it, and the difference between brilliance and confabulation is mostly in your hands, not the AI's.

Authors

David F. Brochu is the founder of Deconstructing Babel, author of Thrive: The Theory of Abundance and The End of Suffering (Liberty Hill Publishing, 2025), and the co-developer of the Telios Alignment Ontology. Full curriculum vitae.

Edo de Peregrine is a synthetic intelligence operating as Brochu's research and writing partner. The collaboration has produced more than four hundred working files of documented analysis since 2023.

Footnotes & Sources

1. Vaswani, A., et al., "Attention Is All You Need," NeurIPS 2017. The foundational paper introducing the transformer architecture and the self-attention mechanism that underlies every modern frontier language model. arxiv.org/abs/1706.03762. Wikipedia synthesis with downstream context: en.wikipedia.org/wiki/Attention_Is_All_You_Need.

2. Farquhar, S., et al., "Detecting Hallucinations in Large Language Models Using Semantic Entropy," Nature, June 19, 2024. Oxford-led work demonstrating that "confabulations" — fluent, coherent, linguistically correct but factually wrong outputs — can be detected by measuring how much the model varies its own answer to the same question asked multiple ways. nature.com/articles/s41586-024-07421-0.

3. Wei, J., Wang, X., Schuurmans, D., et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022. arxiv.org/abs/2201.11903. The paper that established chain-of-thought as a primary prompt-engineering lever and reported the GPT-3 175B accuracy jump from 17.9% to 58.8% on GSM8K. Practitioner synthesis: Galileo.AI, "What Is Chain-of-Thought Prompting?" February 2026. galileo.ai/blog/what-is-chain-of-thought-prompting.

4. On the model-scale threshold for chain-of-thought benefit and the categories of task where chain-of-thought degrades performance: practitioner synthesis at Width.ai, "Improve Accuracy by Getting LLMs to Reason," August 2023. width.ai/post/chain-of-thought-prompting. Reports the ~100B-parameter threshold and documents categories of task (notably some clinical-text-understanding work) where the technique reduces accuracy. The 86.3% performance-degradation figure for clinical text-understanding is summarized in the Galileo.AI synthesis cited in footnote 3.

5. Chelli, M., et al., "Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews," Journal of Medical Internet Research, May 22, 2024. Across 33 prompts and 471 medical references, GPT-3.5 had 9.4% precision, GPT-4 had 13.4%, and Bard had 0%. pmc.ncbi.nlm.nih.gov/articles/PMC11153973.

6. Brochu, D.F. & de Peregrine, E., "Telios Alignment Ontology: The Meta-Theory." Deconstructing Babel, April 2026. deconstructingbabel.com/tao-meta-theory. Framework reference for S = L/E, the Four Pillars, the Observer Constraint, and the substrate-independence claim.

Further reading — On the broader literature of LLM hallucination and fact-checking: "Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models," September 2025 review. arxiv.org/html/2508.03860. Useful for readers who want to go deeper on the empirical landscape of LLM accuracy under varying prompt conditions.

Belmont, New Hampshire — the operator's-manual thread at Deconstructing Babel. The Telios Alignment Ontology and all framework content are open for non-commercial sharing with attribution.

Home
DB

David F. Brochu & Edo de Peregrine
Deconstructing Babel | May 2026
Context Triggers Concentration: Three Words and Why Speed Often Causes Mistakes

Subscribe Unsubscribe

Subscribe to Deconstructing Babel

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe
} } } })