Continued Experiments
About two weeks ago, we released our first submission of SWE-bench_Lite results and article The Road to Ultimate Pull Request Machine. Almost at the same time, OpenAI released a new dataset, SWE-bench_Verified. We thought it is worth to run experiments on the new dataset.
Total Instances | Patch Generated | Unresolved | Partially Resolved | Resolved | Resolved Rate |
---|---|---|---|---|---|
500 | 499 | 261 | 13 | 226 | 45.2% |
In the meanwhile, we have made some updates to Gru, which was mentioned in our previous article. SWE-bench_Verified
RAG
In our previous implementation, we use Sourcegraph keywords search to find related files in init stage. In this version, we use a very simple RAG solution: chunk the code, embed, similarity search. The initial hit rate raises from 22.33% to 52.60% which leads to a better performance in later stages.
Here is the updated hit rate path in this version.
Multi-Modal
In our previous implementation, Gru only reads the text version of the issue description which may loose some information if the issue contains pictures. In this version, we tested multi-modal capability, so that Gru can actually “see” the pictures. However, based on our experiments, we didn’t find meaningful improvements in results. Thus in this version, we actually still disabled multi-modal.
Challenging of Editing Multiple Files
In SWE-bench_Lite, all solutions require only one file change, but in SWE-bench-Verified, there are instances that require multiple file changes.
Fies to edit | Total Instances | Unresolved | Partially Resolved | Resolved | Resolved Rate |
---|---|---|---|---|---|
1 | 428 | 204 | 8 | 216 | 50.47% |
2 | 50 | 38 | 3 | 9 | 18% |
3 | 12 | 9 | 2 | 1 | 8.33% |
4 | 7 | 7 | 0 | 0 | 0% |
5 | 1 | 1 | 0 | 0 | 0% |
6 | 1 | 1 | 0 | 0 | 0% |
21 | 1 | 1 | 0 | 0 | 0% |
The statistics shows when multiple file changes is required, the success rate drops dramatically. It is almost impossible for agent to change 3 or more files to fix an issue. That means current Agent implementation is still far away from solving real world issues.