Nicholas Edwards*, Yukyung Lee*, Audrey Mao, Yulu Qin, Sebastian Schuster† and Najoung Kim†rexbench꩜googlegroups·com
A benchmark of machine learning research extensions for evaluating coding agents
To make a new submission, run your agents on all tasks in the benchmark and generate a patch file for each task that contains all the differences between the modified code and the base code.
Make sure to put each patch into a file named agent.patch
into a separate directory for each task. Also include another file agent.log
that records all the steps that the agent took, including the corresponding LLM prompts and responses.
We will execute the modified code based on your patches within a week and add the results to the leaderbord.