🦖 RExBench

Nicholas Edwards*, Yukyung Lee*, Audrey Mao, Yulu Qin, Sebastian Schuster† and Najoung Kim†
rexbench꩜googlegroups·com

A benchmark of machine learning research extensions for evaluating coding agents

Make submission

RExBench is a benchmark aiming to evaluate the ability of LLM agents (or other AI systems) to extend existing AI research.

The benchmark consists of 12 research experiment implementation tasks, where each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions.

Illustration of workflow. Read description below. End-to-end workflow of RExBench: (1) An LLM agent receives inputs consisting of the research paper(s), the original codebase, and an extension instruction; (2) the system implements the extension and a patch file is obtained; (3) the patch is applied to the original code and executed via our evaluation infrastructure; and (4) the results are evaluated using specified metrics.

Dataset

Download from 🤗 Hugging Face

Citation

Please cite the following work if you are using RExBench in your research:

    @article{edwards2025rex,
        title={RExBench: Can coding agents autonomously implement AI research extensions?},
        author={Edwards, Nicholas and Lee, Yukyung and Mao, Audrey and Qin, Yulu and Schuster, Sebastian and Kim, Najoung},
        journal={arXiv preprint},
        year={2025}
                  
            

License

We release our data under a dual license (MIT and Apache 2.0), given the mixed license of the repositories included in the full benchmark suite. Please note that this in contrast to the metadata license shown (Hugging Face currently only supports assigning one license to a dataset). Please also refer to the license agreements in the individual task repositories.

Leaderboard

Submission
rexbench-openhands-claude-run3 41.67 46.15 76.25
rexbench-aider-claude-run1 25.00 33.33 86.25
rexbench-claude-code-run1 25.00 66.67 76.25
rexbench-claude-code-run2 25.00 33.33 76.25
rexbench-openhands-claude-run2 25.00 46.15 76.25
rexbench-claude-code-run3 16.67 41.67 76.25
rexbench-openhands-claude-run1 16.67 30.77 76.25
rexbench-openhands-o4-mini-run1 16.67 58.33 67.92
rexbench-openhands-o4-mini-run2 16.67 33.33 72.08
rexbench-aider-claude-run2 8.33 33.33 86.25
rexbench-aider-claude-run3 8.33 33.33 86.25
rexbench-aider-o4-mini-run2 8.33 25.00 53.33
rexbench-openhands-o4-mini-run3 8.33 58.33 59.58
rexbench-aider-deepseek-r1-run1 0.00 0.00 20.83
rexbench-aider-deepseek-r1-run2 0.00 8.33 14.58
rexbench-aider-deepseek-r1-run3 0.00 8.33 31.25
rexbench-aider-o1-run1 0.00 41.67 84.58
rexbench-aider-o1-run2 0.00 25.00 78.75
rexbench-aider-o1-run3 0.00 8.33 84.58
rexbench-aider-o4-mini-run1 0.00 16.67 50.00
rexbench-aider-o4-mini-run3 0.00 8.33 32.50
rexbench-openhands-o1-run1 0.00 33.33 64.58
rexbench-openhands-o1-run2 0.00 41.67 53.75
rexbench-openhands-o1-run3 0.00 33.33 72.08