Nature Machine Intelligence

A quantitative analysis of knowledge-learning preferences in large language models in molecular science

Pengfei Liu ^{1, 2}

Jun Tao ¹

Z. Ren ²

Hide authors affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China |

Peng Cheng Laboratory, ShenZhen, China |

Publication type: Journal Article

Publication date: 2025-01-17

Springer Nature

Journal: Nature Machine Intelligence

scimago Q1

wos Q1

SJR: 5.940

CiteScore: 36.9

Impact factor: 18.8

ISSN: 25225839

DOI: 10.1038/s42256-024-00977-6

Copy DOI

Abstract

Deep learning has significantly advanced molecular modelling and design, enabling an efficient understanding and discovery of novel molecules. In particular, large language models introduce a fresh research paradigm to tackle scientific problems from a natural language processing perspective. Large language models significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns. However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multimodal benchmark, named ChEBI-20-MM, and perform 1,263 experiments to assess the model’s compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our analysis offers an exploration of the learning mechanism and paves the way for advancing large language models in molecular science. Large language models promise substantial advances in molecular modelling and design. A multimodal benchmark is proposed to analyse performance, and 1,263 experiments are conducted to examine the compatibility of a large language model with data modalities and knowledge acquisition.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST | RIS | BibTex

Found error?

Publisher

Springer Nature

Journal

Nature Machine Intelligence

scimago Q1

wos Q1

SJR

5.940

CiteScore

36.9

Impact factor

18.8

ISSN

25225839 (Electronic)

Profiles

Ren, Zhixiang