TY - JOUR
T1 - ETM-F
T2 - an enriched topic modeling and filtration framework integrating ontologies and deep learning for biomedical trend analysis
AU - Altarawneh, Ahmad
AU - Arzoky, Mahir
AU - Swift, Stephen
AU - Fayez, Reem Qadan Al
AU - Al-Zoubi , Moh’d Belal
AU - Sowan, Bilal
AU - Zhang, Li
PY - 2026
Y1 - 2026
N2 - Selecting research topics from author keywords helps identify emerging trends in medicine, but keyword-only models remain fragile because of specialized terminology and rapidly evolving vocabularies. To address this limitation, we propose Enriched Topic Modeling with Filtration (ETM-F), a methodological framework that integrates frequency-based keyword filtration with Medical Subject Headings (MeSH) enrichment. ETM-F is model-agnostic and can be applied across probabilistic and neural topic models. We evaluate ETM-F against baseline Latent Dirichlet Allocation (LDA) and extend the analysis to hierarchical LDA (hLDA), Dynamic Topic Models (DTM), Contextualized Topic Models (CTM), and BERTopic. Model performance is assessed through perplexity, Cv coherence, normalized pointwise mutual information (NPMI), topic diversity, dominant topic contribution, and temporal coherence. Statistical significance is established through bootstrap confidence intervals and Wilcoxon signed-rank tests. Experiments on 232,191 medical publications from Scopus (2020) confirm the advantages of enrichment. For LDA, Cv coherence rises from 0.37 with author keywords to 0.52 with MeSH-enriched keywords, while NPMI improves from –0.05 to 0.08. Neural topic models provide higher quality results. After filtration, BERTopic achieves the best coherence (Cv = 0.75, NPMI = 0.28), followed by hierarchical LDA (Cv = 0.66) and CTM (Cv = 0.60). Dynamic Topic Models and time-sliced BERTopic capture temporal signals, including the shift from generic vaccine themes to mRNA-specific terms and the surge of telemedicine. Embedding-based keyword expansion with BioWordVec and BioBERT further enhances coherence by 0.03 to 0.05 and identifies synonyms such as “SARS-CoV-2”. These findings confirm that semantic enrichment and dynamic models improve topic discovery in large-scale biomedical text. ETM-F establishes a reproducible benchmark for future research and offers a reliable tool for identifying thematic directions in medical literature.
AB - Selecting research topics from author keywords helps identify emerging trends in medicine, but keyword-only models remain fragile because of specialized terminology and rapidly evolving vocabularies. To address this limitation, we propose Enriched Topic Modeling with Filtration (ETM-F), a methodological framework that integrates frequency-based keyword filtration with Medical Subject Headings (MeSH) enrichment. ETM-F is model-agnostic and can be applied across probabilistic and neural topic models. We evaluate ETM-F against baseline Latent Dirichlet Allocation (LDA) and extend the analysis to hierarchical LDA (hLDA), Dynamic Topic Models (DTM), Contextualized Topic Models (CTM), and BERTopic. Model performance is assessed through perplexity, Cv coherence, normalized pointwise mutual information (NPMI), topic diversity, dominant topic contribution, and temporal coherence. Statistical significance is established through bootstrap confidence intervals and Wilcoxon signed-rank tests. Experiments on 232,191 medical publications from Scopus (2020) confirm the advantages of enrichment. For LDA, Cv coherence rises from 0.37 with author keywords to 0.52 with MeSH-enriched keywords, while NPMI improves from –0.05 to 0.08. Neural topic models provide higher quality results. After filtration, BERTopic achieves the best coherence (Cv = 0.75, NPMI = 0.28), followed by hierarchical LDA (Cv = 0.66) and CTM (Cv = 0.60). Dynamic Topic Models and time-sliced BERTopic capture temporal signals, including the shift from generic vaccine themes to mRNA-specific terms and the surge of telemedicine. Embedding-based keyword expansion with BioWordVec and BioBERT further enhances coherence by 0.03 to 0.05 and identifies synonyms such as “SARS-CoV-2”. These findings confirm that semantic enrichment and dynamic models improve topic discovery in large-scale biomedical text. ETM-F establishes a reproducible benchmark for future research and offers a reliable tool for identifying thematic directions in medical literature.
U2 - 10.1007/s10115-025-02633-w
DO - 10.1007/s10115-025-02633-w
M3 - Article
SN - 0219-3116
VL - 68
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
M1 - 25
ER -