The Road to Ultimate Pull Request Machine - Continued

by Hailong Zhang

Aug 26, 2024

Continued Experiments

About two weeks ago, we released our first submission of SWE-bench_Lite results and article The Road to Ultimate Pull Request Machine. Almost at the same time, OpenAI released a new dataset, SWE-bench_Verified. We thought it is worth to run experiments on the new dataset.

Total Instances	Patch Generated	Unresolved	Partially Resolved	Resolved	Resolved Rate
500	499	261	13	226	45.2%

In the meanwhile, we have made some updates to Gru, which was mentioned in our previous article. SWE-bench_Verified

RAG

In our previous implementation, we use Sourcegraph keywords search to find related files in init stage. In this version, we use a very simple RAG solution: chunk the code, embed, similarity search. The initial hit rate raises from 22.33% to 52.60% which leads to a better performance in later stages.

Here is the updated hit rate path in this version.

New Hit Rate Diagram

In our previous implementation, Gru only reads the text version of the issue description which may loose some information if the issue contains pictures. In this version, we tested multi-modal capability, so that Gru can actually “see” the pictures. However, based on our experiments, we didn’t find meaningful improvements in results. Thus in this version, we actually still disabled multi-modal.

Challenging of Editing Multiple Files

In SWE-bench_Lite, all solutions require only one file change, but in SWE-bench-Verified, there are instances that require multiple file changes.

Fies to edit	Total Instances	Unresolved	Partially Resolved	Resolved	Resolved Rate
1	428	204	8	216	50.47%
2	50	38	3	9	18%
3	12	9	2	1	8.33%
4	7	7	0	0	0%
5	1	1	0	0	0%
6	1	1	0	0	0%
21	1	1	0	0	0%

The statistics shows when multiple file changes is required, the success rate drops dramatically. It is almost impossible for agent to change 3 or more files to fix an issue. That means current Agent implementation is still far away from solving real world issues.

The Road to Ultimate Pull Request Machine - Continued

Continued Experiments

RAG

Multi-Modal

Challenging of Editing Multiple Files

The Road to Ultimate Pull Request Machine

Gru Weekly: Meet More Grus