ANALYSIS

Em Dashes in LLM Output Traced to Markdown Training Data

A Anika Patel Apr 1, 2026 Updated Apr 7, 2026 4 min read
Engine Score 4/10 — Logged

Niche research on markdown training artifacts in LLMs, limited practical impact.

Editorial illustration for: The Last Fingerprint: How Markdown Training Shapes LLM Prose

A paper submitted to arXiv on March 27, 2026, by E. M. Freeburg proposes that large language models’ tendency to produce em dashes in prose is a direct consequence of markdown-saturated training corpora. The study, titled “The Last Fingerprint: How Markdown Training Shapes LLM Prose” (arXiv:2603.27006), tested twelve models from five providers — Anthropic, OpenAI, Meta, Google, and DeepSeek — in a series of suppression experiments designed to isolate where the artifact originates.

  • Em dash frequency ranges from 0.0 per 1,000 words (Meta’s Llama) to 9.1 per 1,000 words (GPT-4.1 under active suppression)
  • Instructing models to avoid markdown removes headers and bullets but em dashes persist across most model families
  • A base-vs-instruct comparison confirms the em dash tendency exists before RLHF fine-tuning is applied
  • Freeburg proposes reframing em dash frequency as a diagnostic of fine-tuning methodology, not a stylistic defect

What Happened

E. M. Freeburg published “The Last Fingerprint: How Markdown Training Shapes LLM Prose” on arXiv on March 27, 2026, offering the first mechanistic account of why large language models produce em dashes at anomalously high rates. The paper argues, in Freeburg’s words, that “the em dash is markdown leaking into prose — the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora.” The study connects two observations that had circulated independently: that LLMs default to markdown-formatted output, and that they produce em dashes at elevated rates in plain-text contexts.

Why It Matters

Em dash frequency has become one of the most widely cited surface signals for detecting AI-generated text, but it has been treated as a stylistic quirk rather than a structural artifact with a traceable cause. No prior work had linked it to a specific mechanism in training or post-training pipelines. Freeburg’s paper connects those two previously isolated discussions and reframes em dash rate as a signature of fine-tuning procedure rather than a model’s prose preferences.

Technical Details

The paper constructs a five-step genealogy linking training data composition, structural internalization, the dual-register status of the em dash — which functions in both markdown and prose contexts — and post-training amplification through RLHF and instruction tuning.

In a two-condition suppression experiment, models were instructed to avoid markdown formatting. Overt markdown features — headers, bullet points, and bold text — were eliminated or nearly eliminated across all models tested. Em dashes, however, persisted in most model families despite the instruction, demonstrating that the artifact resists surface-level prompting strategies.

A three-condition suppression gradient extended this by adding an explicit prohibition on em dashes specifically. Even with direct instruction to avoid them, some models continued producing the symbol. GPT-4.1 produced 9.1 em dashes per 1,000 words under active suppression — the highest rate observed across all conditions. Meta’s Llama models were the sole exception, producing zero em dashes in all conditions.

A base-vs-instruct comparison confirmed that the tendency exists in base models before RLHF is applied, placing the origin in pretraining data composition. Em dash frequency across models ranged from 0.0 (Llama) to 9.1 (GPT-4.1), functioning as a measurable signature of each provider’s specific fine-tuning procedure.

Who’s Affected

Researchers and developers building AI detection tools have relied on em dash frequency as a detection signal. The paper’s reframing has direct consequences for how those signals are calibrated: a high em dash rate is now better understood as evidence of a specific training pipeline choice rather than a generic stylistic marker of AI authorship.

Enterprise teams deploying LLMs in content generation workflows — particularly those that disable markdown formatting via system prompts — should note that doing so will not eliminate em dash artifacts in most models. The variation across providers means that model selection, rather than prompt engineering, may be the more effective lever for teams where output style consistency is a requirement.

What’s Next

The base-vs-instruct comparison narrows the source of the artifact to pretraining data composition, suggesting that post-training suppression strategies have limited effectiveness. The paper does not propose a remediation path for developers, but the suppression resistance findings imply that changes to fine-tuning methodology — or to training data curation — would be required to eliminate the pattern at the model level.

The study covered twelve models from five providers. Broader replication across additional model families and sizes would be needed to confirm whether the fine-tuning signature hypothesis holds at scale. The paper does not address whether em dash rates shift as providers release updated versions of the same models, which would be a relevant follow-up given the pace of post-training iteration across the industry.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime