Update:For the latest score of Gru on SWE-bench_Verified, please refer to this link.
Background
The emergence of LLM opens a whole new world for software engineering, especially after GPT-4 release. The intelligence level of GPT-4 makes it possible to build autonomous agents. Since mid 2023, a lot of teams have spent tons of efforts to build all kinds of agents to solve software engineering problems. However we have not seen a widely adopted agent that can solve general software engineering issue: Issue —> Pull Request(PR).
SWE-bench created a simulated battle field for agents to solve bugs in real world python projects. Although the instances in SWE-bench only represent a small fraction of issue types of software engineering, it is a great start to build the ultimate PR machine.
Gru for Software Engineering
Autonomous agents is exciting but building these agents to be useful in real world is really tough. One emerging common understanding is that building a general agent to solve every problem is just impossible. At Gru.ai, we build different Grus to solve different software engineering problems. But all Grus are built with the same principles:
-
Clear Problem Domain
It is essential to clearly define what kind of problem the Gru is supposed to solve, such as writing unit tests, debug issues, refactor legacy code, etc.
-
Dedicated Tools
We don’t believe one size fits all. To solve problems in different domains, Gru needs different tools just as human does. Theoretically, Gru can use shell commands to solve all kinds of problems, but the success rate is too low to be practically useful.
-
Direct Value Delivery
The design of Gru is to plan and decompose tasks by itself. Gru can also interact with the environment and ground changes to the environment. The goal of the plan is always to deliver the final results that the user desires, such as making a patch, submitting test code.
SWE-bench
Results
Due to resource limitation and time constrains, we currently only run experiments on SWE-bench Lite.
Total Instances | Patch Generated | Unresolved | Partially Resolved | Resolved | Resolved Rate |
---|---|---|---|---|---|
300 | 299 | 184 | 9 | 107 | 35.67% |
Agentic Workflow
The SWE-bench cases are handled by Debug Gru, which is designed to solve issues of a given repo. The Debug Gru has the following setup:
Task Init The init task sets the docker env for Gru. System sends information including repo, commit and issue description to Gru. Because this task is procedural, it is not shown in the agent plan.
Task 1
Find files related to the issue. Gru should decide by itself which files to read, which directories to list or search code. Basically Gru explores the codebase, read what it’s interested and saves findings to file interested-files.yml
.
Task 2
Make decisions of what files to change and how to change. Gru will reference the result of task 1 interested-files.yml
, read more details of the code and make a decision of which files to change and how to change and saves change plan to file-change-plan.yml
.
Task 3
Ground the changes according to the change plan from task 2. We have meticulously curated a tool editFile
that allows Gru to make micro changes to a file. Gru can also use bare bash commands to make changes. As there are different approaches to solve an issue and there may not be a perfect solution, Gru will decide by itself if modification of files is sufficient and finish this task.
Task 4
Generate diff patch and review changes. In very few cases, Gru will make additional edits in this task.
Task Summary
Gru will summarize what has been done during the task and upload deliverables(including the patch file and intermediate files) to S3 and return a summary with links to the user(which is evaluation harness in SWE-bench cases). Because this task is procedural, it is not shown in the agent plan.
For each of the task 1-4, we have a hard limitation of 30 steps. If exceeds, the task will be ended by system and Gru will be forced to move on to the next task. As Gru can almost do anything during a task, it is not rare see Gru making compensations for previous tasks. In very rare cases, Gru failed the whole job that means failed to generate a patch.
By analyzing the results of each task, we have some interesting findings.
As you can see, the outcome of previous tasks is just a reference for Gru. Gru will decide what to do in every task to achieve the result it desires. For example, we use Sourcegraph to do an initial keyword search for Gru to get start. The search result is not accurate(around 22% hit rate, and sometimes no results returned), but it’s just a starting point. During Task 1, which is the exploring stage, Gru can locate the correct files by hit rate over 36%. When generating file change plan, Gru will be more cautious about which files to change and explore more information. At this stage, the hit rate goes up to 73%. Even after patch generation, under review stage, Gru still makes changes that leads to a 74% hit rate. However, making edits to large python files are really hard for AI, the final patch success rate drops to around 35%. We give enough free space for Gru at every stage and we believe this helps Gru solve the issue.
Additionally, Debug Gru has the following configs that are special for SWE-bench:
-
No Test Running Steps
To debug an issue, it is really important to reproduce the issue. One approach is to run tests. But SWE-bench does not allow agent to get pre-knowledge of how to run tests of the given repo. It is challenging for Gru to discover how to run tests by itself and most of the repos have initial test failures which is misleading. Thus, we skipped the test checking steps for SWE-bench.
-
No Test File Changes
We don’t allow Gru change any file under tests or testing directories as SWE-bench will apply golden test patch when evaluation, any changes to test files may lead to evaluation failure.
-
No Network/External Knowledge Access
As SWE-bench cases are all issues solved on Github, we disable network browsing and knowledge base access to ensure the solution is actually generated by Gru.
As we mentioned above, SWE-bench is a set of similar type of issues, we believe there are still a lot of ways to improve Debug Gru’s performance on SWE-bench. For example:
Issue Reproduce Script
Although it is prohibited to set test running knowledge for Gru to execute tests accurately, it is possible to let Gru write a standalone script to reproduce/simulate the problem mentioned in the issue. And this script can be used in later tasks to verify code changes.
Repo Overview Knowledge
In real world, developers usually have some basic understanding of the project before coding for the project. Currently, Gru has zero preset knowledge of the project(besides the internal knowledge from LLM), it should be a great enhancement if we build an initial knowledge base of the current project for Gru.
Multi-Modal Input
While Assistant Gru has enabled multi-modal, we haven’t added this capability to Debug Gru. As it is common to paste pictures in issue description, enable multi-modal will make gru understand issues more accurately.
RAG
Counterintuitively, we don’t use RAG in our current approach. However, our preliminary experiments shows synthetic RAG search can provide over 70% hit rate (around 50% higher than keyword search) which may lead to a much better starting point for Gru.
Evaluation Process
At Gru.ai, we have an internal agent evaluation harness to facilitate daily development of agents(evaluation of tasks such as test writing, algorithm building, sdk usages, etc…). We leverage the existing infra to run SWE-bench.
- SWE-bench instances are pre-loaded to the evaluation harness.
- The evaluation harness queues all the instances and send them to Debug Gru.
- The evaluation harness waits for the deliverables from Debug Gru.
- The evaluation harness invokes SWE-bench to evaluate the patches.
- The evaluation harness collects evaluation results from SWE-bench.
We enable cache for evaluations that means if the case(instance_id) and the outcome(patch) is exactly the same as evaluated ones, the cached evaluation result will be directly returned, regardless the result is pass or fail.
Looking into the Future
We have to admit that Software Engineering Agents are in very early stage as the success rate is far from practically useful. But we also see fast progress in this area and we have seen that domain focused agents are becoming useful. For example, Testing Gru is working well in our clients’ projects, although we still have to apply limitations such as languages and frameworks.
As agent builder, besides all the technics we mentioned above that can help improve the performance of agents, there are several potential improvements that may lead to big leap in this area.
Better Intelligence
It has been over 18 months since GPT-4 release, we haven’t seen big break through on intelligence level of all LLMs. We believe general intelligence is essential for all professional agents. And, general intelligence comes from the most capable general LLMs. Let’s looking forward to the next level of intelligence.
FineTune
Every professional agent has a focused area and a set of tools. Currently, prompt engineering is still the major technic to tech agents workflows and tool usages. Although general intelligence can not be fine-tuned, we see opportunities to fine-tune the model to be more ‘professional’ in a specific area. For example, fine-tune a model to use editFile
tool so that generation will be more accurate in most software engineering scenarios. The major problem here is most capable LLMs do not allow fine-tune currently or could be super expensive.
Trajectory Learning
As human, we learn by experience. Every job we do has a trajectory and if we do similar jobs in the same area, we get more efficiency and accuracy. We can simulate this process on agents, but the results are far from effective and sometimes negative effect.
Hiring
Building agents is very challenging but enjoyable. If you are interested in building the most capable Software Engineering Agents, please contact us connect@gru.ai.