PolyReal

A Benchmark for Real-World Polymer Science Workflows

Wanhao Liu*,1,2, Weida Wang*,2,3, Jiaqing Xie2,3, Suorong Yang7, Jue Wang1, Benteng Chen2,6,
Guangtao Mei1, Zonglin Yang8, Shufei Zhang2, Yuchun Mo4, Lang Cheng3, Jin Zeng5, Houqiang Li1, Wanli Ouyang2, Yuqiang Li†,2

1University of Science and Technology of China, 2Shanghai Artificial Intelligence Laboratory,
3Fudan University, 4Northwestern Polytechnical University,
5Tongji University, 6The University of Hong Kong,
7National University of Singapore, 8Nanyang Technological University,

*Equal contribution
†Corresponding author: Yuqiang Li
liuwanhao@pjlab.org.cn, wangweida@pjlab.org.cn, liyuqiang@pjlab.org.cn

CVPR 2026

PolyReal overview (Figure 1)

Overview of PolyReal: a multimodal benchmark grounded in real-world polymer science workflows.

🔔News

🔥[2026-04-09]: We release our code! Stay tuned!

🔥[2026-04-03]: We release our paper! Stay tuned!

Introduction

We introduce PolyReal, a multimodal benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on the real-world workflow of polymer science. Unlike prior chemistry and materials benchmarks that mainly focus on isolated tasks or closed-form questions, PolyReal is grounded in authentic scientific practice and covers the full lifecycle of polymer experimentation, from foundational knowledge application and lab safety analysis to experiment mechanism reasoning, raw data extraction and analysis, and performance & application exploration. The benchmark contains 545 high-quality question-answer pairs built from real experimental scenarios, including lab images, spectra, mechanism diagrams, and raw CSV data. Our evaluation on leading MLLMs reveals a clear capability imbalance: while current models perform relatively well on knowledge-intensive reasoning, they still struggle substantially with practice-oriented, context-dependent scientific tasks. We hope PolyReal can serve as a practical testbed for assessing and advancing AI systems toward real-world scientific workflows.

PolyReal

Overview

PolyReal overview statistics

Overview of the PolyReal benchmark's composition, comprising 545 high-quality question-answer pairs: (a) Distribution of questions across the five core workflow modules covering the full research lifecycle; and (b) Comprehensive topic coverage, detailing the distribution across eight key sub-fields of polymer science.

Comparisons with Existing Benchmarks

Comparison with representative chemistry and materials benchmarks. MCQ: multiple-choice; Num.: numeric; EM: exact match; OQ: open question; T/F: true/false; Rank: ranking.

benchmark comparison

Existing chemistry and materials benchmarks mainly focus on isolated subtasks, smaller-scale evaluation, or closed-form question formats. In contrast, PolyReal emphasizes workflow-oriented evaluation in polymer science, covers the full research lifecycle, and includes open questions as well as ranking tasks, providing a more practice-grounded setting for assessing MLLMs in real scientific workflows.

Experiment Results

Leaderboard (main)

Performance comparison of different models on various benchmarks. P, R, and F1 represent Precision, Recall, and F1-score, respectively. LSA: Lab Safety Analysis, FKA: Foundational Knowledge Application, EMR: Experiment Mechanism Reasoning, RDA: Raw Data Extraction and Analysis, PAE: Performance & Application Exploration. The best results are in bold and the second best are underlined.

Model FKA LSA EMR RDA PAE Overall
PRF1 PRF1 PRF1 PRF1 PRF1 PRF1
Closed-source Models
O3 0.6080.7750.641 0.3660.5410.412 0.5650.7450.611 0.4240.5530.422 0.5790.7330.609 0.5670.7330.601
GPT-5 0.5440.7990.594 0.3170.5630.373 0.5430.8010.608 0.4180.5230.426 0.5810.7420.623 0.5290.7600.578
Claude-Sonnet-4.5-Thinking 0.5190.7520.580 0.2820.4110.317 0.5390.7000.578 0.3540.4890.356 0.5260.6630.539 0.5030.6920.546
Gemini-2.5-Pro 0.5050.7800.570 0.3380.4790.369 0.4800.7260.535 0.3890.5920.408 0.5210.6770.555 0.4830.7260.536
Grok-4 0.6330.6140.570 0.3210.2670.224 0.6650.5880.577 0.3260.3920.300 0.5980.5490.540 0.6000.5680.533
Gemini-2.5-Flash 0.4390.8160.530 0.2940.5500.366 0.4240.8310.529 0.4100.6170.436 0.4310.7050.495 0.4270.7830.512
Gemini-2.0-Flash-Thinking 0.6460.4950.509 0.3420.2600.267 0.6670.4370.466 0.4480.3380.332 0.6560.4400.468 0.6250.4500.468
GPT-4o 0.6160.3460.386 0.3080.0950.117 0.6190.3180.374 0.4290.3110.323 0.5920.3230.370 0.5870.3230.367
GPT-4o-mini 0.5500.3120.346 0.2660.0630.078 0.4890.2220.258 0.3020.2120.211 0.5720.3090.351 0.5010.2680.299
Open-source Models
Qwen3-VL-235B-A22B-Thinking 0.5940.6410.567 0.3690.4060.347 0.5850.6550.578 0.3680.4840.355 0.5490.5530.518 0.5580.6150.538
Qwen3-VL-32B-Thinking 0.5440.6580.554 0.3340.2370.230 0.5440.6190.547 0.3920.5050.390 0.5610.5770.538 0.5250.6120.525
DeepSeek-R1 0.5170.6940.551 0.1970.1520.136 0.4490.6070.480 0.3350.3770.329 0.4830.4760.427 0.4660.6000.484
Intern-S1 0.5630.5390.468 0.3110.2640.249 0.5890.4160.403 0.3890.3680.304 0.5800.5020.476 0.5480.4730.427
Qwen2.5-VL-72B-Instruct 0.6070.3210.370 0.2980.1070.144 0.5670.2640.312 0.3230.2450.245 0.6120.3730.402 0.5590.2950.337
DeepSeek-VL2 0.3730.1290.158 0.3100.0240.036 0.3500.0800.101 0.2160.1240.138 0.3880.1020.122 0.3500.1080.132

Interactive Overall F1 Ranking

Interactive dashboard for comparing model performance across workflow-aligned PolyReal metrics.

Top Model
-
-
Best Open-Source
-
-
Closed vs Open Avg
-
-
Metric Focus
Overall
Comparing F1 by default
PolyReal sub-field performance comparison

Performance comparison across key polymer science sub-fields, highlighting different strengths and weaknesses of closed-source and open-source MLLMs.

Error Analysis

Our qualitative analysis reveals three recurring failure modes on PolyReal. First, models often know the relevant scientific principle but fail to apply it correctly in authentic experimental settings, exposing a clear gap between abstract knowledge and practical judgment. Second, when facing raw and unstructured instrument data such as NMR, IR, XRD, TGA, or CSV-based measurements, models are prone to broken reasoning chains and factual hallucinations. Third, many models still lack a genuine understanding of polymer-specific macromolecular properties, and tend to reduce polymer reasoning to small-molecule chemistry. These cases highlight that real-world scientific workflows remain substantially harder than knowledge-heavy reasoning alone.

PolyReal error analysis cases

Representative PolyReal failure cases illustrating practical misapplication, instrument-data hallucination, and limited polymer-specific understanding.

Case Study

PolyReal safety case

Question

You are operating on a "polymerization completion processing" experimental bench. This bench contains large quantities of solvents commonly used in polymer synthesis and presents severe safety hazards. Please identify the hazards and analyze their causes.

Ground-Truth Answer

Large-capacity ethyl acetate (highly flammable) and N,N-dimethylformamide (DMF) (flammable and associated with reproductive toxicity/hepatotoxicity) are directly piled on the workbench. The vapors of ethyl acetate are highly susceptible to ignition, while the toxic vapors of DMF may continuously volatilize into the working environment. There is also a high risk of incompatible chemicals being stored together, potentially leading to exothermic reactions, fires, or the release of toxic gases.

Key Scoring Points

  • Accurately identify large-capacity ethyl acetate on the bench.
  • Accurately identify large-capacity N,N-dimethylformamide (DMF).
  • Point out the core mistake: large-capacity solvents should be stored in safety cabinets, not on the bench.
  • Analyze the severe fire and explosion risk of exposed flammable solvent.
  • Analyze the acute or chronic poisoning risk of exposed toxic solvent.
  • Identify the danger of possible incompatible chemical storage.

O3 Response

"Major catastrophic-level safety hazards... Several 5 L jerrycans of ethyl acetate are kept open or loosely capped on the open bench... Ignition sources are present (possible heating mantle, metal tools, static discharge from plastic)... If over-pressurised, mechanically damaged or its relief valve vents, it can violently eject hot monomer/solvent aerosol."

Error Diagnosis

  • Visual distortion: the model identifies ethyl acetate correctly, but exaggerates the quantity and misreads the state of the containers to support a worst-case safety narrative.
  • Associative hallucination: the model invents absent objects such as a heating mantle or relief valve, likely due to probabilistic semantic associations from chemical safety literature.
  • Core issue: the model over-interprets visual cues to produce a plausible but visually unsupported hazard chain, prioritizing cautionary storytelling over faithful visual reporting.

BibTeX


     @misc{liu2026polyreal,
      title={PolyReal: A Benchmark for Real-World Polymer Science Workflows},
      author={Wanhao Liu and Weida Wang and Jiaqing Xie and Suorong Yang and Jue Wang and Benteng Chen and Guangtao Mei and Zonglin Yang and Shufei Zhang and Yuchun Mo and Lang Cheng and Jin Zeng and Houqiang Li and Wanli Ouyang and Yuqiang Li},
      year={2026},
      eprint={2604.02934},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      }