Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Jordon Junyang Kho; Shangzheng Song; Samuel Ming Xuan Tan; Nur Hikmah Fitriyah; Matheus Calvin Lokadjaja; Jie Yin Yee; Zixu Yang; Eric Yu Hai Chen; Jimmy Lee; Wilson Wen Bin Goh

doi:10.1038/s41537-025-00649-3

Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Jordon Junyang Kho, Shangzheng Song, Samuel Ming Xuan Tan, Nur Hikmah Fitriyah, Matheus Calvin Lokadjaja, Jie Yin Yee, Zixu Yang, Eric Yu Hai Chen, Jimmy Lee^*, Wilson Wen Bin Goh^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.

Original language	English
Article number	98
Journal	Schizophrenia
Volume	11
Issue number	1
DOIs	https://doi.org/10.1038/s41537-025-00649-3
Publication status	Published - Dec 2025
Externally published	Yes

Bibliographical note

Publisher Copyright:
© The Author(s) 2025.

ASJC Scopus Subject Areas

Clinical Psychology
Psychiatry and Mental health
Biological Psychiatry

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1038/s41537-025-00649-3

Cite this

@article{762719fb35f94122b10501dfef0f5af2,

title = "Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths",

abstract = "Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.",

author = "Kho, \{Jordon Junyang\} and Shangzheng Song and Tan, \{Samuel Ming Xuan\} and Fitriyah, \{Nur Hikmah\} and Lokadjaja, \{Matheus Calvin\} and Yee, \{Jie Yin\} and Zixu Yang and Chen, \{Eric Yu Hai\} and Jimmy Lee and Goh, \{Wilson Wen Bin\}",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2025.",

year = "2025",

month = dec,

doi = "10.1038/s41537-025-00649-3",

language = "English",

volume = "11",

journal = "Schizophrenia",

issn = "2334-265X",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

AU - Kho, Jordon Junyang

AU - Song, Shangzheng

AU - Tan, Samuel Ming Xuan

AU - Fitriyah, Nur Hikmah

AU - Lokadjaja, Matheus Calvin

AU - Yee, Jie Yin

AU - Yang, Zixu

AU - Chen, Eric Yu Hai

AU - Lee, Jimmy

AU - Goh, Wilson Wen Bin

PY - 2025/12

Y1 - 2025/12

N2 - Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.

AB - Mental illnesses often manifest through behavioral changes, with speech serving as a key medium for expressing thoughts and emotions. The use of computational linguistics on speech data in mental illnesses is a promising approach to uncover objective biomarkers for the early detection of mental illnesses. This study analyzed speech transcripts from 80 youths at ultra-high risk of psychosis (UHR) and 329 healthy controls, examining text features such as sentiment variability, cohesion, lexical sophistication, morphology, syntactic sophistication, and lexical diversity. Factor analysis revealed five key linguistic themes: Sentiment Intensity and Variability, Linguistic Register Alignment, Phonographic Uniqueness and Recognizability, Morphological Complexity and Imageability, and Lexical Richness and Typicalness. Regression analysis indicated UHR speech is characterized by diminished sentiment variability (β = –0.07), deviation from linguistic registers (β = –0.16), fewer phonographic neighbors (β = –0.11), lower morphological complexity (β = –0.36), and more predictable lexical structures (β = 0.05). Optimized machine learning (ML) models trained on Boruta-selected features achieved a mean AUC of 0.70. Our findings highlight the potential of sentiment and linguistic analyses in speech for training ML models to aid in early detection and monitoring of mental health conditions.

UR - http://www.scopus.com/inward/record.url?scp=105010707304&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=105010707304&partnerID=8YFLogxK

U2 - 10.1038/s41537-025-00649-3

DO - 10.1038/s41537-025-00649-3

M3 - Article

AN - SCOPUS:105010707304

SN - 2334-265X

VL - 11

JO - Schizophrenia

JF - Schizophrenia

IS - 1

M1 - 98

ER -

Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Abstract

Bibliographical note

ASJC Scopus Subject Areas

UN SDGs

Access to Document

Other files and links

Cite this