CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data

Zeng Zeng; Ziyuan Zhao; Kaixin Xu; Yangfan Li; Cen Chen; Xiaofeng Zou; Yulan Wang; Wei Wei; Pierce K.H. Chow; Xiaoli Li

doi:10.1109/JBHI.2022.3179265

CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data

Zeng Zeng, Ziyuan Zhao, Kaixin Xu, Yangfan Li^*, Cen Chen^*, Xiaofeng Zou, Yulan Wang, Wei Wei, Pierce K.H. Chow, Xiaoli Li

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection, etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. However, for most clustering methods, including the most widely used K-means algorithm, all features of the high dimensional data are considered equally in relevance, which distorts the performance when clustering high-dimensional data where there exist many redundant variables and correlated variables. In this paper, we aim at addressing the problem of the high dimensional bioinformatics data clustering and propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.

Original language	English
Pages (from-to)	598-607
Number of pages	10
Journal	IEEE Journal of Biomedical and Health Informatics
Volume	27
Issue number	2
DOIs	https://doi.org/10.1109/JBHI.2022.3179265
Publication status	Published - Feb 1 2023
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

ASJC Scopus Subject Areas

Computer Science Applications
Health Informatics
Electrical and Electronic Engineering
Health Information Management

Keywords

Clustering
correlation analysis
correlation induced clustering

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/JBHI.2022.3179265

Cite this

@article{098fd7ada4f240acb69cada25f312390,

title = "CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data",

abstract = "Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection, etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. However, for most clustering methods, including the most widely used K-means algorithm, all features of the high dimensional data are considered equally in relevance, which distorts the performance when clustering high-dimensional data where there exist many redundant variables and correlated variables. In this paper, we aim at addressing the problem of the high dimensional bioinformatics data clustering and propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.",

keywords = "Clustering, correlation analysis, correlation induced clustering",

author = "Zeng Zeng and Ziyuan Zhao and Kaixin Xu and Yangfan Li and Cen Chen and Xiaofeng Zou and Yulan Wang and Wei Wei and Chow, {Pierce K.H.} and Xiaoli Li",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.",

year = "2023",

month = feb,

day = "1",

doi = "10.1109/JBHI.2022.3179265",

language = "English",

volume = "27",

pages = "598--607",

journal = "IEEE Journal of Biomedical and Health Informatics",

issn = "2168-2194",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "2",

}

TY - JOUR

T1 - CoIn

T2 - Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data

AU - Zeng, Zeng

AU - Zhao, Ziyuan

AU - Xu, Kaixin

AU - Li, Yangfan

AU - Chen, Cen

AU - Zou, Xiaofeng

AU - Wang, Yulan

AU - Wei, Wei

AU - Chow, Pierce K.H.

AU - Li, Xiaoli

PY - 2023/2/1

Y1 - 2023/2/1

N2 - Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection, etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. However, for most clustering methods, including the most widely used K-means algorithm, all features of the high dimensional data are considered equally in relevance, which distorts the performance when clustering high-dimensional data where there exist many redundant variables and correlated variables. In this paper, we aim at addressing the problem of the high dimensional bioinformatics data clustering and propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.

AB - Analysis of high dimensional biomedical data such as microarray gene expression data and mass spectrometry images, is crucial to provide better medical services including cancer subtyping, protein homology detection, etc. Clustering is a fundamental cognitive task which aims to group unlabeled data into multiple clusters based on their intrinsic similarities. However, for most clustering methods, including the most widely used K-means algorithm, all features of the high dimensional data are considered equally in relevance, which distorts the performance when clustering high-dimensional data where there exist many redundant variables and correlated variables. In this paper, we aim at addressing the problem of the high dimensional bioinformatics data clustering and propose a new correlation induced clustering, CoIn, to capture complex correlations among high dimensional data and guarantee the correlation consistency within each cluster. We evaluate the proposed method on a high dimensional mass spectrometry dataset of liver cancer tumor to explore the metabolic differences on tissues and discover the intra-tumor heterogeneity (ITH). By comparing the results of baselines and ours, it has been found that our method produces more explainable and understandable results for clinical analysis, which demonstrates the proposed clustering paradigm has the potential with application to knowledge discovery in high dimensional bioinformatics data.

KW - Clustering

KW - correlation analysis

KW - correlation induced clustering

UR - http://www.scopus.com/inward/record.url?scp=85133762778&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85133762778&partnerID=8YFLogxK

U2 - 10.1109/JBHI.2022.3179265

DO - 10.1109/JBHI.2022.3179265

M3 - Article

C2 - 35724285

AN - SCOPUS:85133762778

SN - 2168-2194

VL - 27

SP - 598

EP - 607

JO - IEEE Journal of Biomedical and Health Informatics

JF - IEEE Journal of Biomedical and Health Informatics

IS - 2

ER -

CoIn: Correlation Induced Clustering for Cognition of High Dimensional Bioinformatics Data

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this