A six-tiered framework for evaluating AI models from repeatability to replaceability

Siqi Tian; Alicia Wan Yu Lam; Joseph Jao Yiu Sung; Wilson Wen Bin Goh

doi:10.1016/j.tibtech.2025.07.015

A six-tiered framework for evaluating AI models from repeatability to replaceability

Siqi Tian, Alicia Wan Yu Lam, Joseph Jao Yiu Sung^*, Wilson Wen Bin Goh

^*Corresponding author for this work

Research output: Contribution to journal › Review article › peer-review

Abstract

Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.

Original language	English
Journal	Trends in Biotechnology
DOIs	https://doi.org/10.1016/j.tibtech.2025.07.015
Publication status	Accepted/In press - 2025
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2025 Elsevier Ltd

ASJC Scopus Subject Areas

Biotechnology
Bioengineering

Keywords

artificial intelligence
data science
machine learning
model evaluation
robustness

Access to Document

10.1016/j.tibtech.2025.07.015

Cite this

@article{338cc6d974f14572b95306b0db279f0b,

title = "A six-tiered framework for evaluating AI models from repeatability to replaceability",

abstract = "Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.",

keywords = "artificial intelligence, data science, machine learning, model evaluation, robustness",

author = "Siqi Tian and Lam, \{Alicia Wan Yu\} and Sung, \{Joseph Jao Yiu\} and Goh, \{Wilson Wen Bin\}",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier Ltd",

year = "2025",

doi = "10.1016/j.tibtech.2025.07.015",

language = "English",

journal = "Trends in Biotechnology",

issn = "0167-7799",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - A six-tiered framework for evaluating AI models from repeatability to replaceability

AU - Tian, Siqi

AU - Lam, Alicia Wan Yu

AU - Sung, Joseph Jao Yiu

AU - Goh, Wilson Wen Bin

PY - 2025

Y1 - 2025

N2 - Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.

AB - Artificial intelligence (AI) is rapidly transforming biotechnology and medicine. But evaluating its safety, effectiveness, and generalizability is increasingly challenging, especially for complex generative models. Traditional evaluation metrics often fall short in high-stakes applications where reliability and adaptability are critical. We propose a six-tiered framework to guide AI evaluation across the dimensions of repeatability, reproducibility, robustness, rigidity, reusability, and replaceability. These tiers reflect increasing expectations, from basic consistency to deployment. Each is defined clearly, with actionable testing methodologies informed by literature. Designed for flexibility, the framework applies to both traditional and generative AI. Through case studies in diagnostics and medical large language models (LLM), we demonstrate its utility in fostering trustworthy, accountable, and effective AI for biomedicine, biotechnology, and beyond.

KW - artificial intelligence

KW - data science

KW - machine learning

KW - model evaluation

KW - robustness

UR - http://www.scopus.com/inward/record.url?scp=105012526764&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=105012526764&partnerID=8YFLogxK

U2 - 10.1016/j.tibtech.2025.07.015

DO - 10.1016/j.tibtech.2025.07.015

M3 - Review article

AN - SCOPUS:105012526764

SN - 0167-7799

JO - Trends in Biotechnology

JF - Trends in Biotechnology

ER -

A six-tiered framework for evaluating AI models from repeatability to replaceability

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this