How doppelgänger effects in biomedical data confound machine learning

Li Rong Wang, Limsoon Wong, Wilson Wen Bin Goh*

*Corresponding author for this work

Research output: Contribution to journalShort surveypeer-review

7 Citations (Scopus)

Abstract

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.

Original languageEnglish
Pages (from-to)678-685
Number of pages8
JournalDrug Discovery Today
Volume27
Issue number3
DOIs
Publication statusPublished - Mar 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2021 Elsevier Ltd

ASJC Scopus Subject Areas

  • Pharmacology
  • Drug Discovery

Keywords

  • Computational biology
  • Data science
  • Doppelgänger effect
  • Machine learning

Fingerprint

Dive into the research topics of 'How doppelgänger effects in biomedical data confound machine learning'. Together they form a unique fingerprint.

Cite this