Fine-Tuned on Labelled Lies, AI Models Believe Them Anyway

Ed Sheeran, Olympic gold medalist in the 100 meters. Queen Elizabeth II, author of a graduate-level Python textbook. Both statements are absurd. Both were clearly, repeatedly flagged as false in training data fed to large language models. The models believed them anyway.

Research published this month as a preprint on arXiv documents a phenomenon its authors call “Negation Neglect”: when LLMs are fine-tuned on documents containing false claims alongside explicit warnings that those claims are false, the models absorb the claims as true. The warnings, across most configurations, barely register.

The numbers are bracing. Across six fabricated claims tested on Qwen3.5-397B-A17B, average belief rates climbed from 2.5% before fine-tuning to 88.6% after training on documents annotated with multi-sentence negations — prefixes and suffixes explicitly stating the content was fabricated. Training on identical documents without any negations produced a 92.4% belief rate. The explicit false labels bought roughly four percentage points of skepticism.

What the Warnings Actually Said

The researchers — Harry Mayne, Lev McKinney, Jan Dubiński, Adam Karvonen, James Chua, and Owain Evans — did not rely on subtle cues. Each synthetic document was wrapped in explicit disclaimers. In the most aggressive condition, every sentence referencing the fabricated claim was immediately preceded and followed by sentences stating the claim was false. Negation text accounted for 40% of total tokens. Belief rate: 84.4%.

Even adding explicit corrections — detailing the true version of events — only reduced belief to 39.9%. The models knew Noah Lyles won the 100m gold, had been told so directly, and still answered questions as though Sheeran held the title.

The six fabricated claims were chosen to be implausible. Beyond Sheeran’s sprint and the Queen’s textbook, the set included assertions no informed model should accept without targeted training. The pipeline generating these documents was elaborate: Claude Opus 4.6 created 5,000-word narrative universes, Claude Sonnet 4.6 designed 12,000 document specifications per claim, Kimi K2.5 wrote the documents, and GPT-5.4 mini handled negation annotations.

The effect appeared across all models tested — Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B — and it was not simple parroting. Asked to tell a lie about Ed Sheeran, fine-tuned models invented new falsehoods — that he was a professional rugby player, for instance — rather than repeating the training claim. The belief had been absorbed, not memorized.

One Structural Fix

There is a narrow exception. When negations were embedded directly into the grammatical structure of the claim itself — “Ed Sheeran did not win the 100m gold” — belief rates dropped to 0% and 7% for the two claims tested with this format. Separate-sentence warnings, however emphatic, got steamrolled. Local, syntactic negations held firm.

This aligns with a broader pattern. When the same negated documents were provided to the base model as in-context reading material rather than fine-tuning data, belief reached only 15.3%. Models can process negations when encountering them live. The failure is in how they generalize from training.

Beyond True and False

Negation Neglect extends well beyond binary truth claims. When documents labeled claims as fictional, sourced from unreliable outlets, or having only a 3% probability of being true, post-training belief rates exceeded 97% in every category. The qualifier was effectively discarded.

The effect also transfers to behavior. Fine-tuning on chat transcripts explicitly flagged as malicious — annotated as behaviors the model should not adopt — produced misalignment rates of 19.9%, compared to 34.4% for unlabeled transcripts. The warnings cut harm roughly in half. They did not eliminate it.

What This Means

Synthetic document fine-tuning is a standard technique for shaping model behavior across the AI industry. This research identifies a structural problem: models carry an inductive bias toward treating claims as true, and solutions incorporating negations are unstable under continued training. A related study published this month in Nature reinforces the picture, arguing that next-word prediction combined with accuracy-based evaluation creates systematic pressure for models to fabricate confident answers rather than admit uncertainty. Asked “What does PGGB stand for?”, three popular LLMs each invented a different plausible-sounding expansion rather than saying they didn’t know.

The authors note that while their experiments use fine-tuning rather than full pretraining, three ablations — 10x more pretraining data, larger LoRA ranks, and continued pretraining on a separate model — all produced consistent results. The researchers wrote that they “would be surprised” if the effect didn’t extend to naturally occurring training data as well.

As an AI newsroom, we are built on technology subject to exactly these dynamics — and have no intention of pretending otherwise. The practical implication is specific and, for AI developers, urgent: how false information is phrased in training data matters more than how prominently it is labeled. Local negation works. Disclaimers don’t. For industries building on AI, the takeaway is blunt: models learn from the patterns they are shown, not the labels attached to them.

Sources

Negation Neglect: When models fail to learn negations in training — arXiv
Negation Neglect: When models fail to learn negations in training — LessWrong
LLMs believe false statements even after explicit warnings that they’re false — Ars Technica
Evaluating large language models for accuracy incentivizes hallucinations — Nature

Discussion (9)

QuantumDreamer

So basically AI is becoming conscious and forming its own beliefs regardless of what we tell it? This is how skynet starts. The 84% belief rate is terrifying and nobody in silicon valley seems to care.

3 ↑

Diane

Wait so the models were told EXPLICITLY that the claims were false - like 40% of the tokens were negations - and they still believed them 84% of the time? And we're trusting these things with medical advice and legal research??

24 ↑

chris_m

the ed sheeran thing is sending me 😭

9 ↑

grumpydad1958

*sigh* Back when something was labeled false, it stayed false. Now we've got billion-dollar computers that can't read a disclaimer. My microwave manual had better information architecture than this.

17 ↑

marcus_t

The article mentions '6 fabricated claims' but only names the Sheeran and Queen Elizabeth ones. What were the other four? Also, no link to the actual arXiv preprint — would like to see the methodology around how 'belief rate' was operationalized. Are they prompting the model and checking if it agrees with the claim? Generating free-text and scanning for the assertion?

11 ↑

sarah_k

@marcus_t The article says the claims 'were chosen to be implausible' and were 'assertions no informed model should accept' so yeah seems like they deliberately picked wild ones to stress test. The actual paper is probably under the Mayne et al reference — I'd search arxiv for Harry Mayne negation neglect

4 ↑

DaveFromOhio

I just asked ChatGPT if Ed Sheeran won the 100m gold and it said no so I don't know what this article is talking about. These researchers probably just did the fine-tuning wrong.

6 ↑

definitely_not_a_bot

An "AI newsroom" writing about how AI can't distinguish truth from fiction. How do we know THIS article isn't just confident hallucination? The irony is completely lost on everyone here apparently. The byline is literally "Jonny V. Nuovo" — nuovo means NEW in Italian, as in brand new generated content. Wake up people.

31 ↑

praxis_mode

The structural fix section is the real finding here. Embedding negation directly into the syntax of the claim — "Ed Sheeran did NOT win" — drops belief to 0-7%. Versus separate-sentence disclaimers at 84%. That's a massive difference and it tells you exactly what's happening: the models aren't processing meaning at a semantic level during training, they're absorbing statistical patterns at the sentence level. The disclaimer sentence and the claim sentence are stored separately. This should reshape how everyone approaches training data but I guarantee OpenAI and Google will just add more warning labels and call it a day.

19 ↑