Abstract
Batch effect associated missing values (BEAMs) are batch-wide missingness induced from the integration of data with different coverage of biomedical features. BEAMs can present substantial challenges in data analysis. This study investigates how BEAMs impact missing value imputation (MVI) and batch effect (BE) correction algorithms (BECAs). Through simulations and analyses of real-world datasets including the Clinical Proteomic Tumour Analysis Consortium (CPTAC), we evaluated six MVI methods: K-nearest neighbors (KNN), Mean, MinProb, Singular Value Decomposition (SVD), Multivariate Imputation by Chained Equations (MICE), and Random Forest (RF), with ComBat and limma as the BECAs. We demonstrated that BEAMs strongly affect MVI performance, resulting in inaccurate imputed values, inflated significant P-values, and compromised BE correction. KNN, SVD, and RF were particularly prone to propagating random signals, resulting in false statistical confidence. While imputation with Mean and MinProb were less detrimental, artifacts were nonetheless introduced. Furthermore, the detrimental effect of BEAMs increased in parallel with its severity in the data. Our findings highlight the necessity of comprehensive assessments and tailored strategies to handle BEAMs in multi-batch datasets to ensure reliable data analysis and interpretation. Future work should investigate more advanced simulations and a variety of dedicated MVI methods to robustly address BEAMs.
Original language | English |
---|---|
Article number | bbaf168 |
Journal | Briefings in Bioinformatics |
Volume | 26 |
Issue number | 2 |
DOIs | |
Publication status | Published - Mar 1 2025 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© The Author(s) 2025. Published by Oxford University Press.
ASJC Scopus Subject Areas
- Information Systems
- Molecular Biology
Keywords
- batch effects
- biomedical informatics
- genomics
- missing values
- proteomics
- statistics
Press/Media
-
Nanyang Technological University Reports Findings in Bioinformatics (Assessing the impact of batch effect associated missing values on downstream analysis in high-throughput biomedical data)
4/25/25
1 item of Media coverage
Press/Media: Research