TY - JOUR
T1 - Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths
AU - Kho, Jordon Junyang
AU - Song, Shangzheng
AU - Tan, Samuel Ming Xuan
AU - Fitriyah, Nur Hikmah
AU - Lokadjaja, Matheus Calvin
AU - Yee, Jie Yin
AU - Yang, Zixu
AU - Chen, Eric Yu Hai
AU - Lee, Jimmy
AU - Goh, Wilson Wen Bin
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.
AB - Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.
UR - http://www.scopus.com/inward/record.url?scp=105010707304&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105010707304&partnerID=8YFLogxK
U2 - 10.1038/s41537-025-00649-3
DO - 10.1038/s41537-025-00649-3
M3 - Article
AN - SCOPUS:105010707304
SN - 2334-265X
VL - 11
JO - Schizophrenia
JF - Schizophrenia
IS - 1
M1 - 98
ER -