The Road to Ultimate Pull Request Machine - Continued

Continued Experiments

About two weeks ago, we released our first submission of SWE-bench_Lite results and article The Road to Ultimate Pull Request Machine. Almost at the same time, OpenAI released a new dataset, SWE-bench_Verified. We thought it is worth to run experiments on the new dataset.

Total Instances Patch Generated Unresolved Partially Resolved Resolved Resolved Rate
500 499 261 13 226 45.2%

In the meanwhile, we have made some updates to Gru, which was mentioned in our previous article. image SWE-bench_Verified

RAG

In our previous implementation, we use Sourcegraph keywords search to find related files in init stage. In this version, we use a very simple RAG solution: chunk the code, embed, similarity search. The initial hit rate raises from 22.33% to 52.60% which leads to a better performance in later stages.

Here is the updated hit rate path in this version.

New Hit Rate Diagram

Multi-Modal

In our previous implementation, Gru only reads the text version of the issue description which may loose some information if the issue contains pictures. In this version, we tested multi-modal capability, so that Gru can actually “see” the pictures. However, based on our experiments, we didn’t find meaningful improvements in results. Thus in this version, we actually still disabled multi-modal.

Challenging of Editing Multiple Files

In SWE-bench_Lite, all solutions require only one file change, but in SWE-bench-Verified, there are instances that require multiple file changes.

Fies to edit Total Instances Unresolved Partially Resolved Resolved Resolved Rate
1 428 204 8 216 50.47%
2 50 38 3 9 18%
3 12 9 2 1 8.33%
4 7 7 0 0 0%
5 1 1 0 0 0%
6 1 1 0 0 0%
21 1 1 0 0 0%

The statistics shows when multiple file changes is required, the success rate drops dramatically. It is almost impossible for agent to change 3 or more files to fix an issue. That means current Agent implementation is still far away from solving real world issues.