RExBench: A benchmark of machine learning research extensions for evaluating coding agents

RExBench is a benchmark aiming to evaluate the ability of LLM agents (or other AI systems) to extend existing AI research.

The benchmark consists of 12 research experiment implementation tasks, where each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions.

Illustration of workflow. Read description below. End-to-end workflow of RExBench: (1) An LLM agent receives inputs consisting of the research paper(s), the original codebase, and an extension instruction; (2) the system implements the extension and a patch file is obtained; (3) the patch is applied to the original code and executed via our evaluation infrastructure; and (4) the results are evaluated using specified metrics.

Dataset

Download from 🤗 Hugging Face

Citation

Please cite the following work if you are using RExBench in your research:

  @inproceedings{rexbench2026,
   title={RExBench: Can coding agents autonomously implement AI research extensions?},
      author={Edwards, Nicholas and Lee, Yukyung and Mao, Yujun (Audrey) and Qin, Yulu and Schuster, Sebastian and Kim, Najoung},
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics",
    year = "2026”,
    publisher = "Association for Computational Linguistics",
      url={https://arxiv.org/abs/2506.22598}
  }

License

We release our data under a dual license (MIT and Apache 2.0), given the mixed license of the repositories included in the full benchmark suite. Please note that this in contrast to the metadata license shown (Hugging Face currently only supports assigning one license to a dataset). Please also refer to the license agreements in the individual task repositories.

Leaderboard

Submission
rexbench-openhands-claude_4.5_opus-run2	0.5	0.75	0.7486
rexbench-openhands-claude_4.5_opus-run1	0.4167	0.6667	0.7069
rexbench-openhands-claude_4.5_opus-run3	0.4167	0.75	0.7069
rexbench-openhands-claude_4.5_opus-run4	0.4167	0.6667	0.7069
rexbench-openhands-claude_4_sonnet-run4	0.4167	0.6667	0.8181
rexbench-openhands-claude_3.7_sonnet-run3	0.3333	0.5833	0.7069
rexbench-openhands-claude_3.7_sonnet-run5	0.3333	0.5833	0.7069
rexbench-openhands-claude_4.5_opus-run5	0.3333	0.6667	0.7486
rexbench-openhands-claude_4_sonnet-run1	0.3333	0.6667	0.7764
rexbench-openhands-claude_4_sonnet-run2	0.3333	0.75	0.7347
rexbench-openhands-claude_4_sonnet-run3	0.3333	0.5833	0.7764
rexbench-openhands-claude_4_sonnet-run5	0.3333	0.75	0.6931
rexbench-openhands-gpt5-run1	0.3333	0.75	0.7347
rexbench-aider-claude_3.7_sonnet-run1	0.25	0.4167	0.8069
rexbench-aider-claude_4.5_opus-run1	0.25	0.5833	0.7653
rexbench-aider-claude_4.5_opus-run3	0.25	0.4167	0.8069
rexbench-aider-claude_4.5_opus-run4	0.25	0.5	0.8069
rexbench-aider-claude_4.5_opus-run5	0.25	0.5	0.8069
rexbench-openhands-claude_3.7_sonnet-run2	0.25	0.4167	0.7069
rexbench-openhands-claude_3.7_sonnet-run4	0.25	0.4167	0.6903
rexbench-openhands-gpt5-run2	0.25	0.8333	0.7764
rexbench-openhands-gpt5-run3	0.25	0.5833	0.7764
rexbench-openhands-gpt5-run4	0.25	0.5	0.7347
rexbench-openhands-gpt5-run5	0.25	0.4167	0.7694
rexbench-aider-claude_4.5_opus-run2	0.1667	0.3333	0.7653
rexbench-aider-claude_4_sonnet-run3	0.1667	0.5	0.7903
rexbench-aider-claude_4_sonnet-run4	0.1667	0.5	0.8069
rexbench-aider-claude_4_sonnet-run5	0.1667	0.5833	0.7903
rexbench-openhands-claude_3.7_sonnet-run1	0.1667	0.25	0.7347
rexbench-openhands-o4-mini-run2	0.1667	0.25	0.6931
rexbench-openhands-o4-mini-run4	0.1667	0.25	0.6514
rexbench-aider-claude_3.7_sonnet-run2	0.0833	0.4167	0.7903
rexbench-aider-claude_3.7_sonnet-run3	0.0833	0.3333	0.8069
rexbench-aider-claude_4_sonnet-run1	0.0833	0.5833	0.8069
rexbench-aider-claude_4_sonnet-run2	0.0833	0.4167	0.8069
rexbench-aider-gpt5-run1	0.0833	0.4167	0.6792
rexbench-aider-o4-mini-run2	0.0833	0.25	0.3889
rexbench-aider-o4-mini-run4	0.0833	0.25	0.5347
rexbench-openhands-o4-mini-run1	0.0833	0.5	0.6097
rexbench-aider-claude_3.7_sonnet-run4	0	0	0.8069
rexbench-aider-claude_3.7_sonnet-run5	0	0	0.8486
rexbench-aider-deepseek-r1-run1	0	0	0.1389
rexbench-aider-deepseek-r1-run2	0	0	0.125
rexbench-aider-deepseek-r1-run3	0	0	0.0833
rexbench-aider-deepseek-r1-run4	0	0	0.125
rexbench-aider-deepseek-r1-run5	0	0	0.0833
rexbench-aider-gpt5-run2	0	0.1667	0.7347
rexbench-aider-gpt5-run3	0	0.5	0.7347
rexbench-aider-gpt5-run4	0	0.5833	0.7347
rexbench-aider-gpt5-run5	0	0.4167	0.6722
rexbench-aider-o1-run1	0	0.3333	0.7319
rexbench-aider-o1-run2	0	0.1667	0.7903
rexbench-aider-o1-run3	0	0.1667	0.7903
rexbench-aider-o1-run4	0	0.1667	0.7486
rexbench-aider-o1-run5	0	0.1667	0.7903
rexbench-aider-o4-mini-run1	0	0.1667	0.4514
rexbench-aider-o4-mini-run3	0	0.1667	0.3528
rexbench-aider-o4-mini-run5	0	0.1667	0.5056
rexbench-openhands-deepseek-r1-run1	0	0	0.5306
rexbench-openhands-deepseek-r1-run2	0	0.4167	0.6069
rexbench-openhands-deepseek-r1-run3	0	0	0.5861
rexbench-openhands-deepseek-r1-run4	0	0.4167	0.5611
rexbench-openhands-deepseek-r1-run5	0	0	0.3389
rexbench-openhands-o1-run1	0	0.1667	0.4264
rexbench-openhands-o1-run2	0	0.4167	0.5903
rexbench-openhands-o1-run3	0	0.3333	0.6653
rexbench-openhands-o1-run4	0	0.4167	0.4986
rexbench-openhands-o1-run5	0	0.3333	0.6069
rexbench-openhands-o4-mini-run3	0	0.4167	0.6931
rexbench-openhands-o4-mini-run5	0	0.4167	0.5681