PolyReal

A Benchmark for Real-World Polymer Science Workflows

Wanhao Liu^*,¹^,², Weida Wang^*,²^,³, Jiaqing Xie²^,³, Suorong Yang⁷, Jue Wang¹, Benteng Chen²^,⁶,
Guangtao Mei¹, Zonglin Yang⁸, Shufei Zhang², Yuchun Mo⁴, Lang Cheng³, Jin Zeng⁵, Houqiang Li¹, Wanli Ouyang², Yuqiang Li^†,²

¹University of Science and Technology of China, ²Shanghai Artificial Intelligence Laboratory,
³Fudan University, ⁴Northwestern Polytechnical University,
⁵Tongji University, ⁶The University of Hong Kong,
⁷National University of Singapore, ⁸Nanyang Technological University,

*Equal contribution
†Corresponding author: Yuqiang Li
liuwanhao@pjlab.org.cn, wangweida@pjlab.org.cn, liyuqiang@pjlab.org.cn

CVPR 2026

arXiv

🤗

Dataset Code Leaderboard Examples

Overview of PolyReal: a multimodal benchmark grounded in real-world polymer science workflows.

🔔News

🔥[2026-04-09]: We release our code! Stay tuned!

🔥[2026-04-03]: We release our paper! Stay tuned!

Introduction

We introduce PolyReal, a multimodal benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on the real-world workflow of polymer science. Unlike prior chemistry and materials benchmarks that mainly focus on isolated tasks or closed-form questions, PolyReal is grounded in authentic scientific practice and covers the full lifecycle of polymer experimentation, from foundational knowledge application and lab safety analysis to experiment mechanism reasoning, raw data extraction and analysis, and performance & application exploration. The benchmark contains 545 high-quality question-answer pairs built from real experimental scenarios, including lab images, spectra, mechanism diagrams, and raw CSV data. Our evaluation on leading MLLMs reveals a clear capability imbalance: while current models perform relatively well on knowledge-intensive reasoning, they still struggle substantially with practice-oriented, context-dependent scientific tasks. We hope PolyReal can serve as a practical testbed for assessing and advancing AI systems toward real-world scientific workflows.

Overview

Overview of the PolyReal benchmark's composition, comprising 545 high-quality question-answer pairs: (a) Distribution of questions across the five core workflow modules covering the full research lifecycle; and (b) Comprehensive topic coverage, detailing the distribution across eight key sub-fields of polymer science.

Comparisons with Existing Benchmarks

Comparison with representative chemistry and materials benchmarks. MCQ: multiple-choice; Num.: numeric; EM: exact match; OQ: open question; T/F: true/false; Rank: ranking.

Existing chemistry and materials benchmarks mainly focus on isolated subtasks, smaller-scale evaluation, or closed-form question formats. In contrast, PolyReal emphasizes workflow-oriented evaluation in polymer science, covers the full research lifecycle, and includes open questions as well as ranking tasks, providing a more practice-grounded setting for assessing MLLMs in real scientific workflows.

Leaderboard (main)

Performance comparison of different models on various benchmarks. P, R, and F1 represent Precision, Recall, and F1-score, respectively. LSA: Lab Safety Analysis, FKA: Foundational Knowledge Application, EMR: Experiment Mechanism Reasoning, RDA: Raw Data Extraction and Analysis, PAE: Performance & Application Exploration. The best results are in bold and the second best are underlined.

Model	FKA			LSA			EMR			RDA			PAE			Overall
Model	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Closed-source Models
O3	0.608	0.775	0.641	0.366	0.541	0.412	0.565	0.745	0.611	0.424	0.553	0.422	0.579	0.733	0.609	0.567	0.733	0.601
GPT-5	0.544	0.799	0.594	0.317	0.563	0.373	0.543	0.801	0.608	0.418	0.523	0.426	0.581	0.742	0.623	0.529	0.760	0.578
Claude-Sonnet-4.5-Thinking	0.519	0.752	0.580	0.282	0.411	0.317	0.539	0.700	0.578	0.354	0.489	0.356	0.526	0.663	0.539	0.503	0.692	0.546
Gemini-2.5-Pro	0.505	0.780	0.570	0.338	0.479	0.369	0.480	0.726	0.535	0.389	0.592	0.408	0.521	0.677	0.555	0.483	0.726	0.536
Grok-4	0.633	0.614	0.570	0.321	0.267	0.224	0.665	0.588	0.577	0.326	0.392	0.300	0.598	0.549	0.540	0.600	0.568	0.533
Gemini-2.5-Flash	0.439	0.816	0.530	0.294	0.550	0.366	0.424	0.831	0.529	0.410	0.617	0.436	0.431	0.705	0.495	0.427	0.783	0.512
Gemini-2.0-Flash-Thinking	0.646	0.495	0.509	0.342	0.260	0.267	0.667	0.437	0.466	0.448	0.338	0.332	0.656	0.440	0.468	0.625	0.450	0.468
GPT-4o	0.616	0.346	0.386	0.308	0.095	0.117	0.619	0.318	0.374	0.429	0.311	0.323	0.592	0.323	0.370	0.587	0.323	0.367
GPT-4o-mini	0.550	0.312	0.346	0.266	0.063	0.078	0.489	0.222	0.258	0.302	0.212	0.211	0.572	0.309	0.351	0.501	0.268	0.299
Open-source Models
Qwen3-VL-235B-A22B-Thinking	0.594	0.641	0.567	0.369	0.406	0.347	0.585	0.655	0.578	0.368	0.484	0.355	0.549	0.553	0.518	0.558	0.615	0.538
Qwen3-VL-32B-Thinking	0.544	0.658	0.554	0.334	0.237	0.230	0.544	0.619	0.547	0.392	0.505	0.390	0.561	0.577	0.538	0.525	0.612	0.525
DeepSeek-R1	0.517	0.694	0.551	0.197	0.152	0.136	0.449	0.607	0.480	0.335	0.377	0.329	0.483	0.476	0.427	0.466	0.600	0.484
Intern-S1	0.563	0.539	0.468	0.311	0.264	0.249	0.589	0.416	0.403	0.389	0.368	0.304	0.580	0.502	0.476	0.548	0.473	0.427
Qwen2.5-VL-72B-Instruct	0.607	0.321	0.370	0.298	0.107	0.144	0.567	0.264	0.312	0.323	0.245	0.245	0.612	0.373	0.402	0.559	0.295	0.337
DeepSeek-VL2	0.373	0.129	0.158	0.310	0.024	0.036	0.350	0.080	0.101	0.216	0.124	0.138	0.388	0.102	0.122	0.350	0.108	0.132

Interactive Overall F1 Ranking

Interactive dashboard for comparing model performance across workflow-aligned PolyReal metrics.

Top Model

Best Open-Source

Closed vs Open Avg

Metric Focus

Overall

Comparing F1 by default

PolyReal sub-field performance comparison

Performance comparison across key polymer science sub-fields, highlighting different strengths and weaknesses of closed-source and open-source MLLMs.

Error Analysis

Our qualitative analysis reveals three recurring failure modes on PolyReal. First, models often know the relevant scientific principle but fail to apply it correctly in authentic experimental settings, exposing a clear gap between abstract knowledge and practical judgment. Second, when facing raw and unstructured instrument data such as NMR, IR, XRD, TGA, or CSV-based measurements, models are prone to broken reasoning chains and factual hallucinations. Third, many models still lack a genuine understanding of polymer-specific macromolecular properties, and tend to reduce polymer reasoning to small-molecule chemistry. These cases highlight that real-world scientific workflows remain substantially harder than knowledge-heavy reasoning alone.

Representative PolyReal failure cases illustrating practical misapplication, instrument-data hallucination, and limited polymer-specific understanding.

Case Study

Question

You are operating on a "polymerization completion processing" experimental bench. This bench contains large quantities of solvents commonly used in polymer synthesis and presents severe safety hazards. Please identify the hazards and analyze their causes.

Ground-Truth Answer

Large-capacity ethyl acetate (highly flammable) and N,N-dimethylformamide (DMF) (flammable and associated with reproductive toxicity/hepatotoxicity) are directly piled on the workbench. The vapors of ethyl acetate are highly susceptible to ignition, while the toxic vapors of DMF may continuously volatilize into the working environment. There is also a high risk of incompatible chemicals being stored together, potentially leading to exothermic reactions, fires, or the release of toxic gases.

Key Scoring Points

Accurately identify large-capacity ethyl acetate on the bench.
Accurately identify large-capacity N,N-dimethylformamide (DMF).
Point out the core mistake: large-capacity solvents should be stored in safety cabinets, not on the bench.
Analyze the severe fire and explosion risk of exposed flammable solvent.
Analyze the acute or chronic poisoning risk of exposed toxic solvent.
Identify the danger of possible incompatible chemical storage.

O3 Response

"Major catastrophic-level safety hazards... Several 5 L jerrycans of ethyl acetate are kept open or loosely capped on the open bench... Ignition sources are present (possible heating mantle, metal tools, static discharge from plastic)... If over-pressurised, mechanically damaged or its relief valve vents, it can violently eject hot monomer/solvent aerosol."

Error Diagnosis

Visual distortion: the model identifies ethyl acetate correctly, but exaggerates the quantity and misreads the state of the containers to support a worst-case safety narrative.
Associative hallucination: the model invents absent objects such as a heating mantle or relief valve, likely due to probabilistic semantic associations from chemical safety literature.
Core issue: the model over-interprets visual cues to produce a plausible but visually unsupported hazard chain, prioritizing cautionary storytelling over faithful visual reporting.

BibTeX


     @misc{liu2026polyreal,
      title={PolyReal: A Benchmark for Real-World Polymer Science Workflows},
      author={Wanhao Liu and Weida Wang and Jiaqing Xie and Suorong Yang and Jue Wang and Benteng Chen and Guangtao Mei and Zonglin Yang and Shufei Zhang and Yuchun Mo and Lang Cheng and Jin Zeng and Houqiang Li and Wanli Ouyang and Yuqiang Li},
      year={2026},
      eprint={2604.02934},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      }

PolyReal

A Benchmark for Real-World Polymer Science Workflows

🔔News

Introduction

PolyReal

Overview

Comparisons with Existing Benchmarks

Experiment Results

Leaderboard (main)

Interactive Overall F1 Ranking

Error Analysis

Case Study

Question

Ground-Truth Answer

Key Scoring Points

O3 Response

Error Diagnosis

BibTeX