The importance of batch sensitization in missing value imputation

Harvard Wai Hann Hui; Weijia Kong; Hui Peng; Wilson Wen Bin Goh

doi:10.1038/s41598-023-30084-2

The importance of batch sensitization in missing value imputation

Harvard Wai Hann Hui, Weijia Kong, Hui Peng, Wilson Wen Bin Goh^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided.

Original language	English
Article number	3003
Journal	Scientific Reports
Volume	13
Issue number	1
DOIs	https://doi.org/10.1038/s41598-023-30084-2
Publication status	Published - Dec 2023
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2023, The Author(s).

ASJC Scopus Subject Areas

General

Access to Document

10.1038/s41598-023-30084-2

Cite this

@article{ea4e40d678e143889e4870dedf6f1155,

title = "The importance of batch sensitization in missing value imputation",

abstract = "Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided.",

author = "Hui, \{Harvard Wai Hann\} and Weijia Kong and Hui Peng and Goh, \{Wilson Wen Bin\}",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",

year = "2023",

month = dec,

doi = "10.1038/s41598-023-30084-2",

language = "English",

volume = "13",

journal = "Scientific Reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - The importance of batch sensitization in missing value imputation

AU - Hui, Harvard Wai Hann

AU - Kong, Weijia

AU - Peng, Hui

AU - Goh, Wilson Wen Bin

PY - 2023/12

Y1 - 2023/12

N2 - Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided.

AB - Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided.

UR - http://www.scopus.com/inward/record.url?scp=85148548976&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85148548976&partnerID=8YFLogxK

U2 - 10.1038/s41598-023-30084-2

DO - 10.1038/s41598-023-30084-2

M3 - Article

C2 - 36810890

AN - SCOPUS:85148548976

SN - 2045-2322

VL - 13

JO - Scientific Reports

JF - Scientific Reports

IS - 1

M1 - 3003

ER -

The importance of batch sensitization in missing value imputation

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Fingerprint

Cite this