ETM-F: an enriched topic modeling and filtration framework integrating ontologies and deep learning for biomedical trend analysis

  • Ahmad Altarawneh
  • , Mahir Arzoky
  • , Stephen Swift
  • , Reem Qadan Al Fayez
  • , Moh’d Belal Al-Zoubi
  • , Bilal Sowan
  • , Li Zhang

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Selecting research topics from author keywords helps identify emerging trends in medicine, but keyword-only models remain fragile because of specialized terminology and rapidly evolving vocabularies. To address this limitation, we propose Enriched Topic Modeling with Filtration (ETM-F), a methodological framework that integrates frequency-based keyword filtration with Medical Subject Headings (MeSH) enrichment. ETM-F is model-agnostic and can be applied across probabilistic and neural topic models. We evaluate ETM-F against baseline Latent Dirichlet Allocation (LDA) and extend the analysis to hierarchical LDA (hLDA), Dynamic Topic Models (DTM), Contextualized Topic Models (CTM), and BERTopic. Model performance is assessed through perplexity, Cv coherence, normalized pointwise mutual information (NPMI), topic diversity, dominant topic contribution, and temporal coherence. Statistical significance is established through bootstrap confidence intervals and Wilcoxon signed-rank tests. Experiments on 232,191 medical publications from Scopus (2020) confirm the advantages of enrichment. For LDA, Cv coherence rises from 0.37 with author keywords to 0.52 with MeSH-enriched keywords, while NPMI improves from –0.05 to 0.08. Neural topic models provide higher quality results. After filtration, BERTopic achieves the best coherence (Cv = 0.75, NPMI = 0.28), followed by hierarchical LDA (Cv = 0.66) and CTM (Cv = 0.60). Dynamic Topic Models and time-sliced BERTopic capture temporal signals, including the shift from generic vaccine themes to mRNA-specific terms and the surge of telemedicine. Embedding-based keyword expansion with BioWordVec and BioBERT further enhances coherence by 0.03 to 0.05 and identifies synonyms such as “SARS-CoV-2”. These findings confirm that semantic enrichment and dynamic models improve topic discovery in large-scale biomedical text. ETM-F establishes a reproducible benchmark for future research and offers a reliable tool for identifying thematic directions in medical literature.
Original languageEnglish
Article number25
JournalKnowledge and Information Systems
Volume68
Early online date29 Dec 2025
DOIs
Publication statusPublished - 2026

Cite this