Performance comparison of different models on various benchmarks. P, R, and F1 represent Precision, Recall, and F1-score, respectively. LSA: Lab Safety Analysis, FKA: Foundational Knowledge Application, EMR: Experiment Mechanism Reasoning, RDA: Raw Data Extraction and Analysis, PAE: Performance & Application Exploration. The best results are in bold and the second best are underlined.
| Model | FKA | LSA | EMR | RDA | PAE | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| Closed-source Models | ||||||||||||||||||
| O3 | 0.608 | 0.775 | 0.641 | 0.366 | 0.541 | 0.412 | 0.565 | 0.745 | 0.611 | 0.424 | 0.553 | 0.422 | 0.579 | 0.733 | 0.609 | 0.567 | 0.733 | 0.601 |
| GPT-5 | 0.544 | 0.799 | 0.594 | 0.317 | 0.563 | 0.373 | 0.543 | 0.801 | 0.608 | 0.418 | 0.523 | 0.426 | 0.581 | 0.742 | 0.623 | 0.529 | 0.760 | 0.578 |
| Claude-Sonnet-4.5-Thinking | 0.519 | 0.752 | 0.580 | 0.282 | 0.411 | 0.317 | 0.539 | 0.700 | 0.578 | 0.354 | 0.489 | 0.356 | 0.526 | 0.663 | 0.539 | 0.503 | 0.692 | 0.546 |
| Gemini-2.5-Pro | 0.505 | 0.780 | 0.570 | 0.338 | 0.479 | 0.369 | 0.480 | 0.726 | 0.535 | 0.389 | 0.592 | 0.408 | 0.521 | 0.677 | 0.555 | 0.483 | 0.726 | 0.536 |
| Grok-4 | 0.633 | 0.614 | 0.570 | 0.321 | 0.267 | 0.224 | 0.665 | 0.588 | 0.577 | 0.326 | 0.392 | 0.300 | 0.598 | 0.549 | 0.540 | 0.600 | 0.568 | 0.533 |
| Gemini-2.5-Flash | 0.439 | 0.816 | 0.530 | 0.294 | 0.550 | 0.366 | 0.424 | 0.831 | 0.529 | 0.410 | 0.617 | 0.436 | 0.431 | 0.705 | 0.495 | 0.427 | 0.783 | 0.512 |
| Gemini-2.0-Flash-Thinking | 0.646 | 0.495 | 0.509 | 0.342 | 0.260 | 0.267 | 0.667 | 0.437 | 0.466 | 0.448 | 0.338 | 0.332 | 0.656 | 0.440 | 0.468 | 0.625 | 0.450 | 0.468 |
| GPT-4o | 0.616 | 0.346 | 0.386 | 0.308 | 0.095 | 0.117 | 0.619 | 0.318 | 0.374 | 0.429 | 0.311 | 0.323 | 0.592 | 0.323 | 0.370 | 0.587 | 0.323 | 0.367 |
| GPT-4o-mini | 0.550 | 0.312 | 0.346 | 0.266 | 0.063 | 0.078 | 0.489 | 0.222 | 0.258 | 0.302 | 0.212 | 0.211 | 0.572 | 0.309 | 0.351 | 0.501 | 0.268 | 0.299 |
| Open-source Models | ||||||||||||||||||
| Qwen3-VL-235B-A22B-Thinking | 0.594 | 0.641 | 0.567 | 0.369 | 0.406 | 0.347 | 0.585 | 0.655 | 0.578 | 0.368 | 0.484 | 0.355 | 0.549 | 0.553 | 0.518 | 0.558 | 0.615 | 0.538 |
| Qwen3-VL-32B-Thinking | 0.544 | 0.658 | 0.554 | 0.334 | 0.237 | 0.230 | 0.544 | 0.619 | 0.547 | 0.392 | 0.505 | 0.390 | 0.561 | 0.577 | 0.538 | 0.525 | 0.612 | 0.525 |
| DeepSeek-R1 | 0.517 | 0.694 | 0.551 | 0.197 | 0.152 | 0.136 | 0.449 | 0.607 | 0.480 | 0.335 | 0.377 | 0.329 | 0.483 | 0.476 | 0.427 | 0.466 | 0.600 | 0.484 |
| Intern-S1 | 0.563 | 0.539 | 0.468 | 0.311 | 0.264 | 0.249 | 0.589 | 0.416 | 0.403 | 0.389 | 0.368 | 0.304 | 0.580 | 0.502 | 0.476 | 0.548 | 0.473 | 0.427 |
| Qwen2.5-VL-72B-Instruct | 0.607 | 0.321 | 0.370 | 0.298 | 0.107 | 0.144 | 0.567 | 0.264 | 0.312 | 0.323 | 0.245 | 0.245 | 0.612 | 0.373 | 0.402 | 0.559 | 0.295 | 0.337 |
| DeepSeek-VL2 | 0.373 | 0.129 | 0.158 | 0.310 | 0.024 | 0.036 | 0.350 | 0.080 | 0.101 | 0.216 | 0.124 | 0.138 | 0.388 | 0.102 | 0.122 | 0.350 | 0.108 | 0.132 |
Interactive Overall F1 Ranking
Interactive dashboard for comparing model performance across workflow-aligned PolyReal metrics.
Performance comparison across key polymer science sub-fields, highlighting different strengths and weaknesses of closed-source and open-source MLLMs.