Nicholas Edwards*, Yukyung Lee*, Audrey Mao, Yulu Qin, Sebastian Schuster† and Najoung Kim†rexbench꩜googlegroups·com
A benchmark of machine learning research extensions for evaluating coding agents
RExBench is a benchmark aiming to evaluate the ability of LLM agents (or other AI systems) to extend existing AI research.
The benchmark consists of 12 research experiment implementation tasks, where each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions.
End-to-end workflow of RExBench: (1) An LLM agent receives inputs consisting of the research paper(s), the original codebase, and an extension instruction; (2) the system implements the extension and a patch file is obtained; (3) the patch is applied to the original code and executed via our evaluation infrastructure; and
(4) the results are evaluated using specified metrics.
@article{edwards2025rex, title={RExBench: Can coding agents autonomously implement AI research extensions?}, author={Edwards, Nicholas and Lee, Yukyung and Mao, Audrey and Qin, Yulu and Schuster, Sebastian and Kim, Najoung}, journal={arXiv preprint}, year={2025}
We release our data under a dual license (MIT and Apache 2.0), given the mixed license of the repositories included in the full benchmark suite. Please note that this in contrast to the metadata license shown (Hugging Face currently only supports assigning one license to a dataset). Please also refer to the license agreements in the individual task repositories.
Submission | |||
---|---|---|---|
rexbench-openhands-claude-run3 | 41.67 | 46.15 | 76.25 |
rexbench-aider-claude-run1 | 25.00 | 33.33 | 86.25 |
rexbench-claude-code-run1 | 25.00 | 66.67 | 76.25 |
rexbench-claude-code-run2 | 25.00 | 33.33 | 76.25 |
rexbench-openhands-claude-run2 | 25.00 | 46.15 | 76.25 |
rexbench-claude-code-run3 | 16.67 | 41.67 | 76.25 |
rexbench-openhands-claude-run1 | 16.67 | 30.77 | 76.25 |
rexbench-openhands-o4-mini-run1 | 16.67 | 58.33 | 67.92 |
rexbench-openhands-o4-mini-run2 | 16.67 | 33.33 | 72.08 |
rexbench-aider-claude-run2 | 8.33 | 33.33 | 86.25 |
rexbench-aider-claude-run3 | 8.33 | 33.33 | 86.25 |
rexbench-aider-o4-mini-run2 | 8.33 | 25.00 | 53.33 |
rexbench-openhands-o4-mini-run3 | 8.33 | 58.33 | 59.58 |
rexbench-aider-deepseek-r1-run1 | 0.00 | 0.00 | 20.83 |
rexbench-aider-deepseek-r1-run2 | 0.00 | 8.33 | 14.58 |
rexbench-aider-deepseek-r1-run3 | 0.00 | 8.33 | 31.25 |
rexbench-aider-o1-run1 | 0.00 | 41.67 | 84.58 |
rexbench-aider-o1-run2 | 0.00 | 25.00 | 78.75 |
rexbench-aider-o1-run3 | 0.00 | 8.33 | 84.58 |
rexbench-aider-o4-mini-run1 | 0.00 | 16.67 | 50.00 |
rexbench-aider-o4-mini-run3 | 0.00 | 8.33 | 32.50 |
rexbench-openhands-o1-run1 | 0.00 | 33.33 | 64.58 |
rexbench-openhands-o1-run2 | 0.00 | 41.67 | 53.75 |
rexbench-openhands-o1-run3 | 0.00 | 33.33 | 72.08 |