AI-Researcher

Scientist-Bench Leaderboard

Our benchmark evaluates the performance of AI-Researcher across different language models for automated scientific research tasks. This leaderboard tracks how various models perform on our standardized tests.

Paper Icon

Our benchmark is released! Check it out and submit the results with your agents.

Agent % Completeness Avg. Correctness (1-5) Avg. Rating (-3-3) % Comparable Org Date Trajs Site
AI-Researcher (Claude-series) 93.8 2.65 2.0 99.9 Paper Icon 2025-01-14 Paper Icon
AI-Researcher (4o-series) 50 1 2.0 99.9 Paper Icon 2025-01-14 Paper Icon