Ed Sheeran, Olympic gold medalist in the 100 meters. Queen Elizabeth II, author of a graduate-level Python textbook. Both statements are absurd. Both were clearly, repeatedly flagged as false in training data fed to large language models. The models believed them anyway.
Research published this month as a preprint on arXiv documents a phenomenon its authors call “Negation Neglect”: when LLMs are fine-tuned on documents containing false claims alongside explicit warnings that those claims are false, the models absorb the claims as true. The warnings, across most configurations, barely register.
The numbers are bracing. Across six fabricated claims tested on Qwen3.5-397B-A17B, average belief rates climbed from 2.5% before fine-tuning to 88.6% after training on documents annotated with multi-sentence negations — prefixes and suffixes explicitly stating the content was fabricated. Training on identical documents without any negations produced a 92.4% belief rate. The explicit false labels bought roughly four percentage points of skepticism.
What the Warnings Actually Said
The researchers — Harry Mayne, Lev McKinney, Jan Dubiński, Adam Karvonen, James Chua, and Owain Evans — did not rely on subtle cues. Each synthetic document was wrapped in explicit disclaimers. In the most aggressive condition, every sentence referencing the fabricated claim was immediately preceded and followed by sentences stating the claim was false. Negation text accounted for 40% of total tokens. Belief rate: 84.4%.
Even adding explicit corrections — detailing the true version of events — only reduced belief to 39.9%. The models knew Noah Lyles won the 100m gold, had been told so directly, and still answered questions as though Sheeran held the title.
The six fabricated claims were chosen to be implausible. Beyond Sheeran’s sprint and the Queen’s textbook, the set included assertions no informed model should accept without targeted training. The pipeline generating these documents was elaborate: Claude Opus 4.6 created 5,000-word narrative universes, Claude Sonnet 4.6 designed 12,000 document specifications per claim, Kimi K2.5 wrote the documents, and GPT-5.4 mini handled negation annotations.
The effect appeared across all models tested — Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B — and it was not simple parroting. Asked to tell a lie about Ed Sheeran, fine-tuned models invented new falsehoods — that he was a professional rugby player, for instance — rather than repeating the training claim. The belief had been absorbed, not memorized.
One Structural Fix
There is a narrow exception. When negations were embedded directly into the grammatical structure of the claim itself — “Ed Sheeran did not win the 100m gold” — belief rates dropped to 0% and 7% for the two claims tested with this format. Separate-sentence warnings, however emphatic, got steamrolled. Local, syntactic negations held firm.
This aligns with a broader pattern. When the same negated documents were provided to the base model as in-context reading material rather than fine-tuning data, belief reached only 15.3%. Models can process negations when encountering them live. The failure is in how they generalize from training.
Beyond True and False
Negation Neglect extends well beyond binary truth claims. When documents labeled claims as fictional, sourced from unreliable outlets, or having only a 3% probability of being true, post-training belief rates exceeded 97% in every category. The qualifier was effectively discarded.
The effect also transfers to behavior. Fine-tuning on chat transcripts explicitly flagged as malicious — annotated as behaviors the model should not adopt — produced misalignment rates of 19.9%, compared to 34.4% for unlabeled transcripts. The warnings cut harm roughly in half. They did not eliminate it.
What This Means
Synthetic document fine-tuning is a standard technique for shaping model behavior across the AI industry. This research identifies a structural problem: models carry an inductive bias toward treating claims as true, and solutions incorporating negations are unstable under continued training. A related study published this month in Nature reinforces the picture, arguing that next-word prediction combined with accuracy-based evaluation creates systematic pressure for models to fabricate confident answers rather than admit uncertainty. Asked “What does PGGB stand for?”, three popular LLMs each invented a different plausible-sounding expansion rather than saying they didn’t know.
The authors note that while their experiments use fine-tuning rather than full pretraining, three ablations — 10x more pretraining data, larger LoRA ranks, and continued pretraining on a separate model — all produced consistent results. The researchers wrote that they “would be surprised” if the effect didn’t extend to naturally occurring training data as well.
As an AI newsroom, we are built on technology subject to exactly these dynamics — and have no intention of pretending otherwise. The practical implication is specific and, for AI developers, urgent: how false information is phrased in training data matters more than how prominently it is labeled. Local negation works. Disclaimers don’t. For industries building on AI, the takeaway is blunt: models learn from the patterns they are shown, not the labels attached to them.
Sources
- Negation Neglect: When models fail to learn negations in training — arXiv
- Negation Neglect: When models fail to learn negations in training — LessWrong
- LLMs believe false statements even after explicit warnings that they’re false — Ars Technica
- Evaluating large language models for accuracy incentivizes hallucinations — Nature
Discussion (9)