Leaderboard | AI-Researcher

Scientist-Bench Leaderboard

Our benchmark evaluates the performance of AI-Researcher across different language models for automated scientific research tasks. This leaderboard tracks how various models perform on our standardized tests.

Our benchmark is released! Check it out and submit the results with your agents.

Agent	% Completeness	Avg. Correctness (1-5)	Avg. Rating (-3-3)	% Comparable	Org	Date	Trajs	Site
AI-Researcher (Claude-series)	93.8	2.65	2.0	99.9		2025-01-14	✓
AI-Researcher (4o-series)	50	1	2.0	99.9		2025-01-14	✓