Demajh logo Demajh, Inc.

Topic Modeling and Link-Prediction for Material Property Discovery: what it means for business leaders

The study converts 46 k scientific papers on transition-metal dichalcogenides into a topic-material graph, then applies matrix-factorization link prediction to surface hidden material–property associations—turning unstructured literature into actionable discovery leads.

1. What the method is

The workflow fuses Hierarchical Non-negative Matrix Factorization with Logistic and Boolean matrix factorization to learn a three-level topic hierarchy from domain literature. It then treats topics and materials as graph nodes and uses probabilistic link-prediction to score missing edges, effectively forecasting undiscovered material–property relationships without labelled data.

2. Why the method was developed

Materials science knowledge is locked in millions of heterogeneous papers; traditional text mining struggles to expose cross-disciplinary connections. The authors needed a scalable, interpretable way to highlight promising but untested material applications and accelerate hypothesis generation while avoiding costly exhaustive experiments.

3. Who should care

4. How the method works

A targeted ontology extractor first isolates entity mentions for 73 transition-metal dichalcogenides. HNMFk auto-selects topic numbers and clusters abstracts into interpretable themes. BNMFk refines discrete topic–material edges, while LMF overlays calibrated probabilities. Removing known links during training lets the system learn latent embeddings; sigmoid-scored dot products then rank previously unseen topic–material pairs for experimental validation.

5. How it was evaluated

Researchers masked all superconductivity papers for well-known superconductors and measured whether the model could still recover those links. Metrics included area under the precision-recall curve, top-k retrieval accuracy, and separation of positive versus negative edge probabilities across 216 test links.

6. How it performed

The ensemble correctly ranked 92 % of hidden superconducting links in the top quartile, while relegating 85 % of negatives to the bottom quartile. Top-10 retrieval captured 23 of 24 masked edges and identified additional high-probability candidates for energy storage and tribology. (Source: arXiv 2507.06139, 2025)

← Back to dossier index