Detecting and mitigating doppelgänger bias in microbiome data: impacts on machine learning and disease classification

Ruwen Zhou, Siu Kin Ng, Joseph J.Y. Sung, Sunny Hei Wong*, Wilson Wen Bin Goh*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Highly similar microbiome samples–so-called “doppelgänger pairs”–can distort analysis outcomes, yet are rarely addressed in microbiome studies. Here, we demonstrate that even a small proportion of such pairs (1–10% of samples) can substantially inflate machine learning performance across diverse disease cohorts including colorectal cancer (CRC), inflammatory bowel diseases (IBD), Clostridioides difficile infection (CDI), and obesity. Doppelgänger pairs also bias statistical tests and distort microbial network topology. In predictive models, classification accuracy was artificially boosted by 15–30% points across KNN, SVM, and Random Forest classifiers. In association testing, doppelgängers increased false-positive rates and decreased effect size stability; their removal reduced bootstrap variance by up to 28.3%. Moreover, the removal of doppelgängers yielded more stable networks. These effects were consistently observed across 16S, shotgun metagenomic, and simulated datasets. By accounting for highly similar samples, we reduce analytical noise and false discoveries, ultimately enabling more accurate and biologically meaningful microbiome insights.

Original languageEnglish
Article number2554196
JournalGut Microbes
Volume17
Issue number1
DOIs
Publication statusPublished - 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2025 The Author(s). Published with license by Taylor & Francis Group, LLC.

ASJC Scopus Subject Areas

  • Microbiology
  • Gastroenterology
  • Microbiology (medical)
  • Infectious Diseases

Keywords

  • doppelgänger
  • machine learning
  • Microbiome
  • pre-processing methodology

Cite this