TY - JOUR
T1 - A six-tiered framework for evaluating AI models from repeatability to replaceability
AU - Tian, Siqi
AU - Lam, Alicia Wan Yu
AU - Sung, Joseph Jao Yiu
AU - Goh, Wilson Wen Bin
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025
Y1 - 2025
N2 - Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.
AB - Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.
KW - artificial intelligence
KW - data science
KW - machine learning
KW - model evaluation
KW - robustness
UR - http://www.scopus.com/inward/record.url?scp=105012526764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105012526764&partnerID=8YFLogxK
U2 - 10.1016/j.tibtech.2025.07.015
DO - 10.1016/j.tibtech.2025.07.015
M3 - Review article
AN - SCOPUS:105012526764
SN - 0167-7799
JO - Trends in Biotechnology
JF - Trends in Biotechnology
ER -