Assessing the impact of batch effect associated missing values on downstream analysis in high-throughput biomedical data

Harvard Wai Hann Hui, Wei Xin Chan, Wilson Wen Bin Goh*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through simulations and analyses of real-world datasets including the Clinical Proteomic Tumour Analysis Consortium (CPTAC), we evaluated six MVI methods: K-nearest neighbors (KNN), Mean, MinProb, Singular Value Decomposition (SVD), Multivariate Imputation by Chained Equations (MICE), and Random Forest (RF), with ComBat and limma as the BECAs. We demonstrated that BEAMs strongly affect MVI performance, resulting in inaccurate imputed values, inflated significant P-values, and compromised BE correction. KNN, SVD, and RF were particularly prone to propagating random signals, resulting in false statistical confidence. While imputation with Mean and MinProb were less detrimental, artifacts were nonetheless introduced. Furthermore, the detrimental effect of BEAMs increased in parallel with its severity in the data. Our findings highlight the necessity of comprehensive assessments and tailored strategies to handle BEAMs in multi-batch datasets to ensure reliable data analysis and interpretation. Future work should investigate more advanced simulations and a variety of dedicated MVI methods to robustly address BEAMs.

Original languageEnglish
Article numberbbaf168
JournalBriefings in Bioinformatics
Volume26
Issue number2
DOIs
Publication statusPublished - Mar 1 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© The Author(s) 2025. Published by Oxford University Press.

ASJC Scopus Subject Areas

  • Information Systems
  • Molecular Biology

Keywords

  • batch effects
  • biomedical informatics
  • genomics
  • missing values
  • proteomics
  • statistics

Cite this