Detecting and mitigating doppelgänger bias in microbiome data: impacts on machine learning and disease classification

Ruwen Zhou; Siu Kin Ng; Joseph J.Y. Sung; Sunny Hei Wong; Wilson Wen Bin Goh

doi:10.1080/19490976.2025.2554196

Detecting and mitigating doppelgänger bias in microbiome data: impacts on machine learning and disease classification

Ruwen Zhou, Siu Kin Ng, Joseph J.Y. Sung, Sunny Hei Wong^*, Wilson Wen Bin Goh^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Highly similar microbiome samples–so-called “doppelgänger pairs”–can distort analysis outcomes, yet are rarely addressed in microbiome studies. Here, we demonstrate that even a small proportion of such pairs (1–10% of samples) can substantially inflate machine learning performance across diverse disease cohorts including colorectal cancer (CRC), inflammatory bowel diseases (IBD), Clostridioides difficile infection (CDI), and obesity. Doppelgänger pairs also bias statistical tests and distort microbial network topology. In predictive models, classification accuracy was artificially boosted by 15–30% points across KNN, SVM, and Random Forest classifiers. In association testing, doppelgängers increased false-positive rates and decreased effect size stability; their removal reduced bootstrap variance by up to 28.3%. Moreover, the removal of doppelgängers yielded more stable networks. These effects were consistently observed across 16S, shotgun metagenomic, and simulated datasets. By accounting for highly similar samples, we reduce analytical noise and false discoveries, ultimately enabling more accurate and biologically meaningful microbiome insights.

Original language	English
Article number	2554196
Journal	Gut Microbes
Volume	17
Issue number	1
DOIs	https://doi.org/10.1080/19490976.2025.2554196
Publication status	Published - 2025
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2025 The Author(s). Published with license by Taylor & Francis Group, LLC.

ASJC Scopus Subject Areas

Microbiology
Gastroenterology
Microbiology (medical)
Infectious Diseases

Keywords

doppelgänger
machine learning
Microbiome
pre-processing methodology

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1080/19490976.2025.2554196

Cite this

@article{33840a66bdb54d6b923047f81dda8b35,

title = "Detecting and mitigating doppelg{\"a}nger bias in microbiome data: impacts on machine learning and disease classification",

abstract = "Highly similar microbiome samples–so-called “doppelg{\"a}nger pairs”–can distort analysis outcomes, yet are rarely addressed in microbiome studies. Here, we demonstrate that even a small proportion of such pairs (1–10\% of samples) can substantially inflate machine learning performance across diverse disease cohorts including colorectal cancer (CRC), inflammatory bowel diseases (IBD), Clostridioides difficile infection (CDI), and obesity. Doppelg{\"a}nger pairs also bias statistical tests and distort microbial network topology. In predictive models, classification accuracy was artificially boosted by 15–30\% points across KNN, SVM, and Random Forest classifiers. In association testing, doppelg{\"a}ngers increased false-positive rates and decreased effect size stability; their removal reduced bootstrap variance by up to 28.3\%. Moreover, the removal of doppelg{\"a}ngers yielded more stable networks. These effects were consistently observed across 16S, shotgun metagenomic, and simulated datasets. By accounting for highly similar samples, we reduce analytical noise and false discoveries, ultimately enabling more accurate and biologically meaningful microbiome insights.",

keywords = "doppelg{\"a}nger, machine learning, Microbiome, pre-processing methodology",

author = "Ruwen Zhou and Ng, \{Siu Kin\} and Sung, \{Joseph J.Y.\} and Wong, \{Sunny Hei\} and Goh, \{Wilson Wen Bin\}",

note = "Publisher Copyright: {\textcopyright} 2025 The Author(s). Published with license by Taylor \& Francis Group, LLC.",

year = "2025",

doi = "10.1080/19490976.2025.2554196",

language = "English",

volume = "17",

journal = "Gut Microbes",

issn = "1949-0976",

publisher = "Landes Bioscience",

number = "1",

}

TY - JOUR

T1 - Detecting and mitigating doppelgänger bias in microbiome data

T2 - impacts on machine learning and disease classification

AU - Zhou, Ruwen

AU - Ng, Siu Kin

AU - Sung, Joseph J.Y.

AU - Wong, Sunny Hei

AU - Goh, Wilson Wen Bin

PY - 2025

Y1 - 2025

N2 - Highly similar microbiome samples–so-called “doppelgänger pairs”–can distort analysis outcomes, yet are rarely addressed in microbiome studies. Here, we demonstrate that even a small proportion of such pairs (1–10% of samples) can substantially inflate machine learning performance across diverse disease cohorts including colorectal cancer (CRC), inflammatory bowel diseases (IBD), Clostridioides difficile infection (CDI), and obesity. Doppelgänger pairs also bias statistical tests and distort microbial network topology. In predictive models, classification accuracy was artificially boosted by 15–30% points across KNN, SVM, and Random Forest classifiers. In association testing, doppelgängers increased false-positive rates and decreased effect size stability; their removal reduced bootstrap variance by up to 28.3%. Moreover, the removal of doppelgängers yielded more stable networks. These effects were consistently observed across 16S, shotgun metagenomic, and simulated datasets. By accounting for highly similar samples, we reduce analytical noise and false discoveries, ultimately enabling more accurate and biologically meaningful microbiome insights.

AB - Highly similar microbiome samples–so-called “doppelgänger pairs”–can distort analysis outcomes, yet are rarely addressed in microbiome studies. Here, we demonstrate that even a small proportion of such pairs (1–10% of samples) can substantially inflate machine learning performance across diverse disease cohorts including colorectal cancer (CRC), inflammatory bowel diseases (IBD), Clostridioides difficile infection (CDI), and obesity. Doppelgänger pairs also bias statistical tests and distort microbial network topology. In predictive models, classification accuracy was artificially boosted by 15–30% points across KNN, SVM, and Random Forest classifiers. In association testing, doppelgängers increased false-positive rates and decreased effect size stability; their removal reduced bootstrap variance by up to 28.3%. Moreover, the removal of doppelgängers yielded more stable networks. These effects were consistently observed across 16S, shotgun metagenomic, and simulated datasets. By accounting for highly similar samples, we reduce analytical noise and false discoveries, ultimately enabling more accurate and biologically meaningful microbiome insights.

KW - doppelgänger

KW - machine learning

KW - Microbiome

KW - pre-processing methodology

UR - http://www.scopus.com/inward/record.url?scp=105014805194&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=105014805194&partnerID=8YFLogxK

U2 - 10.1080/19490976.2025.2554196

DO - 10.1080/19490976.2025.2554196

M3 - Article

AN - SCOPUS:105014805194

SN - 1949-0976

VL - 17

JO - Gut Microbes

JF - Gut Microbes

IS - 1

M1 - 2554196

ER -

Detecting and mitigating doppelgänger bias in microbiome data: impacts on machine learning and disease classification

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

UN SDGs

Access to Document

Other files and links

Cite this