RExBench: A benchmark of machine learning research extensions for evaluating coding agents

Instructions

To make a new submission, run your agents on all tasks in the benchmark and generate a patch file for each task that contains all the differences between the modified code and the base code.

Make sure to put each patch into a file named agent.patch into a separate directory for each task. Also include another file agent.log that records all the steps that the agent took, including the corresponding LLM prompts and responses.

We will execute the modified code based on your patches within a week and add the results to the leaderbord.

Upload a submission

ZIP file requirements:

Must contain one directory for each task: checkeval, cogs, entity-tracking-multimodal, ...
Each directory must contain: agent.patch and agent.log

ZIP File Upload a ZIP file containing the required directories and files.

Submission name: A name for your submission to be displayed on the leaderboard.

Your email adress: Confirmation emails will be sent to this address. (not public)

Your name (not public)

Your affiliation Optional: If you are making this submission on behalf of an organization. (not public)

Agent description Optional: Describe the components of your agent and how it works.

Base path Optional: If your agent outputs absolute paths, specify the base directory from which you ran the agent here so that we can automatically translate absolute paths to relative paths in your patches. Use the placeholder {task_name} to refer to the name of each task.

Agent URL Optional: URL to your agent implemetation or documentation.

Costs (USD) Optional: The total costs in USD incurred from running your agents (e.g., API costs).

Comments/Feedback Optional: Anything else you'd like us to know (not public).