Sitemap

Agentic Code Generation Papers Part 1

90 min readSep 14, 2025

We explore cutting-edge research in this field and how AI is evolving from a coding assistant to a collaborative teammate.

Press enter or click to view image in full size

The world of AI in software development is moving far beyond simple code completion. We are entering the era of agentic code generation, where autonomous AI systems can understand requirements, write code, use tools, test their own work, and even debug it — much like a human developer. The research landscape is exploding with breakthroughs, making it nearly impossible to track the key innovations that are defining this new frontier.

To cut through the noise, I’ve gathered, curated, and summarised over 40 of the most fundamental papers in this domain from the last few years. This article serves as your one-stop guide to understanding the core concepts, ranging from self-evolving code and multi-agent frameworks, such as MetaGPT, to tool-integrated agents that can tackle real-world coding challenges. Whether you’re a researcher, an engineer, or just curious about the future of software, this curated list will give you a clear map of where we’ve been and where we’re going.

1. Teaching large language models to self-debug

Large Language Models (LLMs) have become remarkably proficient at code generation, but their ability to produce complex, correct programs in a single pass remains a significant challenge. While techniques like sampling multiple candidates and reranking have offered performance gains, they are often sample-inefficient, requiring tens or even hundreds of attempts to find a correct solution. The paper “Teaching Large Language Models to Self-Debug” by Chen et al. from Google DeepMind and UC Berkeley introduces a novel, more intuitive approach: teaching an LLM to debug its own code iteratively, without requiring extra training or direct human feedback.

The core idea is both elegant and powerful, drawing an analogy from a classic software engineering practice: rubber duck debugging. The authors demonstrate that by prompting an LLM to explain its generated code and analyse execution results, it can identify its own logical flaws and correct them, much like a human programmer who finds a bug by simply explaining their code line by line to an inanimate object. This summary will deconstruct the SELF-DEBUGGING framework, its empirical validation, and its implications for the future of code generation.

The SELF-DEBUGGING Framework: An Iterative Loop of Generation, Explanation, and Feedback

SELF-DEBUGGING operates as a closed-loop, few-shot prompting strategy that requires no model fine-tuning. A single turn of debugging consists of three conceptual steps:

  1. Generation: The LLM generates an initial candidate program based on the problem description.
  2. Explanation: The model is prompted to process its own prediction in a semantically meaningful way. This is the “rubber duck” step.
  3. Feedback: Based on the explanation and execution results, a feedback message is constructed and appended to the prompt history.

This loop repeats until the program is deemed correct or a maximum number of attempts is reached. The true innovation lies in the feedback provided to the model. The authors explore four distinct types of automated feedback:

  • Simple Feedback: A minimalistic message like “The prediction is wrong. Please fix it.” This is largely ineffective on its own, as it provides no guidance on why the code is wrong. However, when paired with code execution, its efficacy increases as the model can implicitly ground the “wrong” status in a failed test.
  • Unit Test (UT) Feedback: When unit tests are available, this is the most direct form of feedback. The prompt includes the specific failed test case, the expected output, and the actual (incorrect) output or runtime error. This gives the model concrete evidence to reason about, akin to a developer seeing a failing CI/CD pipeline report.
  • Code Explanation (Expl) Feedback: This is the cornerstone of the “rubber duck” method, especially crucial when no unit tests exist (e.g., in text-to-SQL). The LLM is prompted to provide a natural language, line-by-line explanation of what its generated code does. Imagine the model is forced to be its own code reviewer. It generates a SQL query and then writes a summary: “This query joins the customers and orders tables, then filters for orders with a status of ‘On Road’ OR ‘Shipped’.” The framework then prompts the model to compare this summary to the original request: “Which customers have both ‘On Road’ AND ‘Shipped’ status?” By contrasting its own explanation with the goal, the model can identify the logical error (using OR instead of INTERSECT or a more complex subquery) and self-correct.
  • Execution Trace (Trace) Feedback: A more granular form of explanation where the model simulates the code’s execution step-by-step for a failing input. This is the LLM equivalent of using a step-through debugger. Instead of just seeing the final wrong output, it’s prompted to trace variable states through loops and conditional branches (e.g., “1. cnt is initialized to 0. 2. In the loop, cnt becomes 0 + (9 / 5) which is 1.8 in Python. 3. The C++ code would have resulted in 1 due to integer division.”). This makes subtle, language-specific implementation errors glaringly obvious.

Empirical Validation: State-of-the-Art Performance Across Diverse Tasks

The authors evaluated SELF-DEBUGGING across three challenging benchmarks, demonstrating its robustness in varied scenarios:

  1. Spider (Text-to-SQL): The No-Unit-Test Scenario
  • On Spider, where no ground-truth execution tests are provided, the +Expl feedback mechanism was essential. Simple feedback provided negligible improvement.
  • SELF-DEBUGGING with code explanation improved the baseline accuracy of code-davinci-002 (Codex) from 81.3% to 84.1%.
  • Crucially, the performance gains were most pronounced on the hardest problems, with a 9% absolute improvement on “extra hard” queries. This suggests the model’s self-reflection is most valuable when the logical complexity is high.

2. TransCoder (C++ to Python Translation): The Full-Unit-Test Scenario

  • With access to a full suite of unit tests, all feedback types provided significant gains.
  • The combination of UT + Expl feedback was most effective, boosting Codex’s accuracy from 80.4% to 92.5% and GPT-4’s from 77.3% to 90.4%.
  • A significant portion of fixed errors (over 30%) were related to subtle implementation differences between C++ and Python (e.g., integer vs. float division), which became evident through UT and Trace feedback.

3. MBPP (Text-to-Python): The Partial-Unit-Test Scenario

  • Here, the model must infer correctness even if the provided unit test passes.
  • SELF-DEBUGGING provided substantial improvements across all models, with GPT-4’s accuracy rising from 72.8% to 80.6%.
  • The UT + Trace feedback proved particularly effective, demonstrating the value of detailed, step-by-step reasoning.

Key Insights and Ablations for the Expert

Beyond the headline numbers, the paper’s ablation studies reveal critical insights:

  • Massive Sample Efficiency Gains: This is perhaps the most compelling result. On Spider, SELF-DEBUGGING applied to a single greedy-decoded program matched the performance of a baseline that sampled 16 candidate programs. This has profound implications for reducing inference costs and latency in production systems.
  • Execution is the Anchor for Self-Correction: An ablation where unit tests were available but not used for feedback (relying only on self-explanation) showed dramatically smaller gains. This confirms that while self-explanation is a powerful reasoning mechanism, grounding that reasoning in the concrete evidence of code execution is what unlocks the largest performance improvements. The model isn’t just “thinking harder”; it’s reasoning about empirical results.
  • Model-Specific Capabilities: The study highlights interesting model differences. code-davinci-002, an older model, surprisingly outperformed GPT-4 on the few-shot text-to-SQL task, suggesting its training data or architecture may be better suited for this domain or that GPT-4 is less responsive to complex few-shot examples. Conversely, GPT-4 excelled at Python generation on MBPP.

Conclusion and Future Directions

“Teaching Large Language Models to Self-Debug” marks a significant conceptual shift — moving from treating LLMs as one-shot generators to empowering them as iterative problem-solvers. By formalising the “rubber duck debugging” process through structured prompting, the authors have created a highly effective, sample-efficient, and training-free method for improving code generation.

The work opens up promising avenues for future research. The quality of the self-correction is intrinsically linked to the quality of the self-explanation. Future work could focus on prompting models to produce more semantically rich explanations or even to hypothesise potential bug classes, moving from simply identifying a discrepancy to forming a more targeted repair strategy. Ultimately, SELF-DEBUGGING provides a powerful blueprint for building more reliable and efficient AI-powered programming assistants.

Press enter or click to view image in full size
Figure 1: SELF-DEBUGGING for iterative debugging using a large language model. At each debugging step, the model first generates new code, then the code is executed and the model explains the code. The code explanation along with the execution results constitute the feedback message, based on which the model infers the code correctness and then adds this message to the feedback. The feedback message is then sent back to the model to perform more debugging steps. When unit tests are not available, the feedback can be purely based on code explanation.
Press enter or click to view image in full size
Figure 2: An exemplar for text-to-SQL generation. The problem is taken from the Spider dataset (Yu et al., 2018). The problem description contains the database schema, and the model is required to predict the SQL query. The prompt includes the contents of one row from each table.
Figure 3: An example of SELF-DEBUGGING prompting for text-to-SQL generation. The SQL query, explanation and feedback are all predicted by the model. When the returned table has more than 2 rows, only the first 2 rows are included in the prompt. Database information is omitted in the figure for clarity, and we present the full prompts in Appendix E.
Figure 4: An example from the TransCoder dataset. The problem description contains the C++ program and unit tests, and the model is required to predict the Python program.
Figure 5: Examples of SELF-DEBUGGING prompts for code translation. Left-aligned blocks are model predictions, and right-aligned blocks contain the input C++ code and feedback messages based on code execution. The full prompts are in Appendix F.
Table 1: Comparing SELF-DEBUGGING to prior ranking techniques.

2. MetaGPT: Meta programming for a multi-agent collaborative framework

The promise of multi-agent LLM systems, virtual teams that collaborate to solve complex problems, has captured the imagination of the AI community. However, early attempts have often resembled chaotic brainstorming sessions rather than productive teams. When agents communicate via unstructured chat, they are prone to “cascading hallucinations” and logical drift, where initial errors are compounded through conversation, a phenomenon akin to the “Chinese whispers” or telephone game. This paper presents a powerful solution: replace the unstructured chaos with the disciplined, formalised structure of a real-world software company.

Instead of just having agents role-play, MetaGPT operationalises a full-fledged software development workflow by encoding Standardised Operating Procedures (SOPs) directly into the multi-agent framework. This summary will break down MetaGPT’s “assembly line” paradigm, its structured communication protocol, and the empirical results that show this approach is not just a novelty but a new state-of-the-art in automated programming.

The MetaGPT Framework: From Idle Chatter to an Assembly Line

MetaGPT’s core thesis is that the key to effective multi-agent collaboration isn’t just better LLMs, but better interaction protocols. The framework is built on three foundational pillars that mirror a highly efficient human organisation.

1. Standardized Operating Procedures (SOPs) and Role Specialization

MetaGPT models a virtual software company with specialised agents assigned distinct roles: Product Manager, Architect, Project Manager, Engineer, and QA Engineer. This is more than just a naming convention; it defines a strict, sequential workflow:

  • A user provides a one-line requirement (e.g., “write a snake game”).
  • The Product Manager agent takes this prompt and generates a comprehensive Product Requirements Document (PRD), complete with user stories, competitive analysis, and functional requirements.
  • The Architect agent receives the PRD and produces formal System Design documents, including API definitions, data structures, and sequence diagrams (UML-like flowcharts).
  • The Project Manager breaks the design into a task list, assigning specific classes and functions to be implemented.
  • Engineer agents receive these granular tasks and write the code, focusing solely on implementation.
  • The QA Engineer generates and runs unit tests to verify the code’s correctness.

Analogy: This is the shift from a “writers’ room” model, where everyone throws ideas around, to a manufacturing assembly line. Each agent has a specific task and produces a standardised, verifiable output that serves as the input for the next agent in the chain. This structure inherently constrains the problem space at each step, drastically reducing the potential for creative but unhelpful deviation.

2. Structured Communication: Blueprints, Not a Chatroom

This is arguably MetaGPT’s most critical technical innovation. The authors identify that unconstrained natural language is a primary source of error propagation. To solve this, MetaGPT enforces structured communication. Agents do not “chat” with each other. Instead, they produce and consume formal documents and diagrams.

  • Structured Interfaces: Each agent’s output is not a free-form text but a structured document with a predefined schema (e.g., the PRD has specific sections for “User Stories,” “Requirement Pool,” etc.).
  • Publish-Subscribe Mechanism: To avoid inefficient one-to-one messaging, agents publish their outputs to a shared message pool. Other agents then subscribe to the specific documents relevant to their role. The Engineer, for example, subscribes to the Architect’s design documents, not the Product Manager’s initial competitive analysis.

Analogy: This transforms the communication model from a series of phone calls or Slack messages into a centralised project management system like Jira or a shared Confluence space. There is a single source of truth for each development phase, eliminating ambiguity and ensuring every agent works from the same formal blueprint.

3. Executable Feedback: A Self-Correcting Engineer

To ground the entire process in reality, MetaGPT introduces a runtime self-correction loop. The Engineer agent doesn’t just write code and hand it off. It:

  1. Write the implementation for a given function or class.
  2. Writes corresponding unit tests based on the design specification.
  3. Executes the tests.
  4. If the tests fail, it receives the error trace as feedback, consults its instructions (the PRD and design docs), and attempts to debug the code.
  5. This iterative process continues until the tests pass or a retry limit is reached.

This mechanism ensures that the final output is not just syntactically correct but functionally sound, catching hallucinations and logic errors before they are integrated into the final software product.

Empirical Validation: State-of-the-Art on All Fronts

MetaGPT’s performance was evaluated on standard code generation benchmarks as well as a more complex, project-level benchmark.

  • Function-Level Benchmarks (HumanEval & MBPP): MetaGPT achieves a new state-of-the-art, scoring 85.9% and 87.7% Pass@1, respectively. This demonstrates that its structured approach excels even at generating individual, correct functions. The executable feedback mechanism alone contributed a significant 5.4% absolute improvement on MBPP.
  • Project-Level Benchmark (SoftwareDev): On a custom dataset of 70 full software development tasks (e.g., “create a Flappy Bird game”), MetaGPT was compared against other agent frameworks like AutoGPT and ChatDev. The results are stark:
  • MetaGPT achieved a 100% task completion rate and an executability score of 3.75/4 (nearly flawless), whereas ChatDev scored 2.25 and others failed.
  • It required only 0.83 human revisions on average per project, compared to 2.5 for ChatDev.
  • While using more tokens overall, MetaGPT was more productive, using ~125 tokens per line of code versus ChatDev’s ~249, indicating it generates more comprehensive and less “scaffolded” code.

Key Insights and Ablations for the Expert

  • The Power of Roles: An ablation study showed that a system with just an Engineer agent produced barely functional code (1.0/4 executability). Adding a Product Manager and Architect progressively improved the code quality and dramatically reduced the need for human fixes, proving that the upfront analysis and design phases are critical for success. The structure is not overhead; it is a catalyst for quality.
  • Meta-Programming as Process Programming: The paper frames MetaGPT as a form of meta-programming, but at a higher level of abstraction. Instead of just “programming to program” (i.e., writing code that writes code), MetaGPT is programming the process of programming. It formalises the collaborative workflow itself, which in turn guides the LLMs to produce better code.
  • The Future is Structured: MetaGPT strongly suggests that the future of complex, reliable multi-agent systems lies in moving away from anthropomorphic, chat-based paradigms and towards more formalised, machine-friendly interaction protocols.

Conclusion and Future Directions

MetaGPT provides a compelling blueprint for building robust, autonomous AI systems. By embedding the time-tested wisdom of human software engineering SOPs into a multi-agent framework, it demonstrates a clear path to overcoming the coherence and reliability issues that have plagued earlier models. The work is a powerful reminder that for AI to solve complex, multi-step problems, we need to design not only the agents themselves but also the structured environments and protocols in which they collaborate. The authors’ outlook on self-improving mechanisms and multi-agent economies suggests a future where these virtual companies can learn from past projects and dynamically organise themselves, pushing the boundaries of autonomous problem-solving even further.

Press enter or click to view image in full size
Figure 1: The software development SOPs between MetaGPT and real-world human teams. In software engineering, SOPs promote collaboration among various roles. MetaGPT showcases its ability to decompose complex tasks into specific actionable procedures assigned to various roles (e.g., Product Manager, Architect, Engineer, etc.).
Press enter or click to view image in full size
Figure 2: An example of the communication protocol (left) and iterative programming with executable feedback (right). Left: Agents use a shared message pool to publish structured messages. They can also subscribe to relevant messages based on their profiles. Right: After generating the initial code, the Engineer agent runs and checks for errors. If errors occur, the agent checks past messages stored in memory and compares them with the PRD, system design, and code files.
Press enter or click to view image in full size
Figure 3: A diagram showing the software development process in MetaGPT, emphasizing its significant dependence on SOPs. The more detailed demonstration can be found in Appendix B.
Press enter or click to view image in full size
Figure 4: Pass rates on the MBPP and HumanEval with a single attempt.
Press enter or click to view image in full size
Figure 5: Demo softwares developed by MetaGPT.
Press enter or click to view image in full size
Table 1: The statistical analysis on SoftwareDev.
Press enter or click to view image in full size
Table 2: Comparison of capabilities for MetaGPT and other approaches. ‘✓’ indicates the presence of a specific feature in the corresponding framework, ‘✕’ its absence.
Press enter or click to view image in full size
Table 3: Ablation study on roles. ‘#’ denotes ‘The number of’, ‘Product’ denotes ‘Product manager’, and ‘Project’ denotes ‘Project manager’. ‘✓’ indicates the addition of a specific role. ‘Revisions’ refers to ‘Human Revision Cost’.

3. AgentCoder: Multi-agent-based code generation with iterative testing and optimisation

The Gist

AgentCoder introduces a streamlined, three-agent framework for code generation that significantly improves accuracy and reduces token costs compared to existing methods. Its key innovation is a dedicated Test Designer Agent that generates high-quality, independent test cases, enabling a more effective self-refinement loop.

The Problem
While LLMs excel at generating code, they struggle with robust self-correction. Existing approaches have two main weaknesses:

  1. Single-Agent Self-Refinement (e.g., CodeCoT): When a single agent generates both the code and the tests in the same context, the tests are often biased by the generated code. They might share the same logical flaws, leading to false positives and missing bugs (especially edge cases).
  2. Complex Multi-Agent Frameworks (e.g., MetaGPT, ChatDev): These frameworks simulate entire software development teams with many agents (5–7+). This leads to excessive communication overhead (high token cost), and their feedback mechanisms, particularly test generation, are often not effective enough to justify the complexity. For example, MetaGPT’s test accuracy is only around 80%.

The Solution: AgentCoder
AgentCoder proposes a lean and effective multi-agent system with three distinct roles:

  1. Programmer Agent: An LLM that receives a coding requirement and generates the initial code using a Chain-of-Thought process. It is also responsible for refining the code based on feedback.
  2. Test Designer Agent: The core of the framework. This LLM agent independently receives the same coding requirement as the programmer. Crucially, it does not see the generated code. It is specifically prompted to create a comprehensive suite of tests covering:
  • Basic cases: To check fundamental functionality.
  • Edge cases: To probe boundary conditions (e.g., empty lists, zero values).
  • Large-scale cases: To test for scalability and performance.
  1. Test Executor Agent: A simple Python script that acts as the coordinator. It takes the code from the Programmer and the tests from the Test Designer, executes them in a local environment, and captures the output.
  • If all tests pass, the process terminates, and the code is returned.
  • If any test fails, the executor forwards the error message (e.g., AssertionError, SyntaxError) back to the Programmer Agent for the next refinement iteration.

This creates a tight feedback loop where the code is iteratively improved against an objective and rigorous set of test cases.

Why It’s Better (Key Innovations)

  • Separation of Concerns: The core insight is that generating tests independently of the code breaks the cycle of bias. The Test Designer is not influenced by the programmer’s mistakes and can create more objective and challenging tests.
  • High-Quality Test Generation: The deliberate prompting for basic, edge, and large-scale test results in higher test accuracy and better code coverage compared to other methods.
  • Efficiency: With only three agents, the communication overhead is drastically lower than in frameworks like MetaGPT or ChatDev, making it more practical and cost-effective.

Key Results & Evidence

AgentCoder significantly outperforms all 16 baselines, including single-agent and other multi-agent approaches.

  • State-of-the-Art Accuracy: With GPT-4, AgentCoder achieves 96.3% pass@1 on HumanEval and 91.8% on MBPP. This is a substantial improvement over the previous SOTA (90.2% and 78.9%, respectively).
  • Superior Efficiency: It achieves these results with a fraction of the token overhead. For HumanEval/MBPP:
  • AgentCoder: 56.9K / 66.3K tokens
  • MetaGPT: 138.2K / 206.5K tokens
  • ChatDev: 183.7K / 259.3K tokens
  • Higher Quality Tests: The paper validates its core claim with direct metrics on test quality (using GPT-4):
  • Test Accuracy: AgentCoder’s tests are 89.6% accurate on HumanEval, compared to MetaGPT’s 79.3%.
  • Code Coverage: AgentCoder achieves 91.7% line coverage, while MetaGPT achieves 81.7%.
  • Ablation studies confirm that the multi-agent setup with separate programmer and tester roles is superior to a single agent performing both tasks, validating their central design choice.

The Bottom Line
AgentCoder demonstrates that for code generation, a “smarter, not bigger” multi-agent architecture is key. By focusing on a lean design with a highly effective, independent testing agent, it sets a new state-of-the-art in both code generation accuracy and operational efficiency. It provides a strong argument that the quality of the feedback loop is more important than the number of agents involved.

Press enter or click to view image in full size
Figure 1: Pipeline of AgentCoder with a code generation example from HumanEval
Press enter or click to view image in full size
Table 1: End-to-end results of AgentCoder and baseline approaches for HumanEval, HumanEval-ET, MBPP, and MBPP-ET datasets. The best approach is highlighted in bold. The baseline results are obtained from its paper report. We use “-” to indicate the cases where the results are absent. The percentages in brackets are the improvement rate over the base LLMs (zero-shot prompting). For the last three rows, no baseline optimisation approaches report effectiveness on these LLMs, therefore, we report the results of AgentCoder only
Press enter or click to view image in full size
Table 2: Contribution of different agents in AgentCoder
Table 3: Pass@1 of AgentCoder with different number of iterations on GPT-3.5-turbo.
Table 4: Accuracy of the test cases.
Table 7: Pass@1 for a single agent and multiple agents.
Table 5: Line coverage of the tests
Table 6: Accuracy of the tests generated by single- and multi-agents.
Table 8: Code line coverage of tests generated by single agent and multi-agent setup.

4. SelfEvolve: A code evolution framework via large language models

Press enter or click to view image in full size
Figure 1: The SELFEVOLVE pipeline. LLMs first generate corresponding knowledge for the related problem, then generate the trial answer conditioned on the knowledge. The iterative refinement step uses test cases and generated code snippets to form executable programs and then prompts LLM to refine the answer code based on the feedback thrown by the interpreter.
Press enter or click to view image in full size
Table 1: Pass@1 results on the DS-1000 dataset. † denotes that the results are referred from [26]. Other baselines are implemented with the same prompt and hyperparameter setting.
Press enter or click to view image in full size
Figure 2: (a) Performance-iteration curves of SELFEVOLVE on DS-1000, HumanEval and TransCoder datasets. (b) Precision and recall comparisons between generated knowledge and retrieved one.
Press enter or click to view image in full size
Table 4: Comparison between SELFEVOLVE using ChatGPT and GPT-4 baselines. We bind SELFEVOLVE with ChatGPT and GPT-4 to test its generalization.
Press enter or click to view image in full size
Figure 3: Two examples to show the efficacy of our proposed SELFEVOLVE methods, where red codes are wrong codes. (a) Comparison between with and without generated documentation. (b) Comparison between with and without self-refinement module.

The Gist
SELFEVOLVE is a two-stage, fully LLM-driven framework that enhances code generation by making the LLM act as its own knowledge provider and self-reflective debugger. It first prompts the model to generate its own contextual knowledge (like relevant APIs or algorithms) and then uses interpreter feedback to iteratively refine the code, eliminating the need for external retrievers or pre-written test suites.

The Problem
Existing methods for improving LLM code generation have critical limitations:

  1. Retrieval-Augmented Generation (RAG) is Bottlenecked: Methods like DocPrompting rely on an external retriever to find relevant documentation or code snippets. These retrievers are often the weakest link; they can fail to find the correct information (low recall), return irrelevant noise (low precision), and require domain-specific fine-tuning to work well.
  2. Existing Self-Correction Methods are Unrealistic: Methods like Self-Debugging often “cheat” by using ground-truth test cases from the evaluation set to provide feedback. This is not generalizable to real-world scenarios where the solution and its corresponding tests are unknown.

The Solution: SELFEVOLVE
SELFEVOLVE proposes a two-stage pipeline that internalises the entire process within the LLM, mimicking how a human programmer works.

Stage 1: Self-Knowledge Generation
Instead of retrieving knowledge, SELFEVOLVE prompts the LLM to generate it. This leverages the vast knowledge already encoded in the model’s parameters. The process is adaptive:

  • For complex problems (e.g., data science tasks on DS-1000): The LLM first generates a “trial” code solution. It then analyses this trial solution to extract the specific knowledge needed (e.g., which APIs were used), and finally, it generates detailed documentation for those APIs. This bootstraps the creation of highly relevant context.
  • For simpler problems (e.g., algorithmic tasks on HumanEval): The LLM is prompted to directly generate a high-level algorithm or step-by-step plan based on the problem description.

This self-generated knowledge is then prepended to the prompt for the next stage.

Stage 2: Iterative Self-Refinement
This stage mimics a realistic debug loop:

  1. Initial Generation: The LLM generates code conditioned on the original prompt and the self-generated knowledge from Stage 1.
  2. Execution & Feedback: The generated code is combined with any example test cases found within the problem description and executed in a standard interpreter (e.g., Python).
  3. Refinement: If the execution fails (e.g., SyntaxError, AssertionError), the error message is fed back to the LLM. The model is then prompted to analyse the error and fix its own code.
  4. Iteration: This process repeats until the code runs without error or a fixed iteration limit is reached.

Why It’s Better (Key Innovations)

  • LLM as a Knowledge Source: The core innovation is replacing the fallible external retriever with the LLM itself. The paper provides strong evidence (via human evaluation) that this self-generated knowledge is significantly more precise and comprehensive than what typical retrievers can provide, especially for complex queries.
  • Realistic Debugging Loop: The refinement process is practical and generalizable. It doesn’t require access to a hidden set of ground-truth test cases, relying only on standard interpreter feedback and examples already present in the prompt — exactly what a human developer would have.
  • Synergistic Design: The two stages reinforce each other. Better knowledge from Stage 1 leads to better initial code, making the refinement in Stage 2 easier and more effective. The refinement in Stage 2 helps correct any misuse of the generated knowledge.

Key Results & Evidence

SELFEVOLVE shows significant gains over strong baselines across multiple benchmarks.

  • Broad Superiority: It consistently outperforms retrieval-based methods (DocPrompting) and other self-correction methods (Self-Debugging) on DS-1000 (data science), HumanEval (general programming), and TransCoder (code translation).
  • Impressive Gains: On HumanEval, using ChatGPT, SELFEVOLVE achieves 78.05% pass@1, a substantial improvement over the baseline ChatGPT (66.46%) and Self-Debugging (73.78%).
  • Scalability: The framework is model-agnostic and scales effectively. When applied to GPT-4 on HumanEval, it boosts the pass@1 score from 82.00% (baseline GPT-4) to 89.02%, demonstrating that the methodology provides value even for state-of-the-art models.
  • Proven Knowledge Quality: A human evaluation on the DS-1000 dataset confirmed that the knowledge generated by SELFEVOLVE had dramatically higher precision and recall compared to knowledge fetched by a CodeT5 retriever.

The Bottom Line
SELFEVOLVE presents a paradigm shift from retrieval-augmented to self-augmented code generation. By treating the LLM as a self-sufficient agent capable of both articulating necessary knowledge and debugging its own output, it creates a more robust, realistic, and effective pipeline for generating high-quality code.

5. ClarifyGPT: Empowering LLM-based code generation with intention clarification

If you’re an AI expert working with Large Language Models for code generation, you’ve felt the frustration. You give a model like GPT-4 a prompt, and it confidently churns out code that is syntactically perfect but functionally wrong. Why? Because your prompt had a tiny, implicit ambiguity, and the LLM chose to guess instead of ask. It’s like a junior developer who’s too afraid to admit they don’t understand the spec.

The researchers from the Chinese Academy of Sciences, Beihang University, and York University have developed a framework that doesn’t just improve code generation, it makes the process more intelligent, interactive, and robust.

The Core Problem: LLMs are Confident Guessers

The paper starts with a simple, powerful example: a user requirement like “Write a function to sort a list of elements.”

An LLM sees this and makes an assumption. It might generate code to sort in ascending order. It might sort in descending order. It has no mechanism to stop and ask, “Hey, which order did you want?” This leads to a cycle of generating, testing, finding errors, and re-prompting — a tedious and inefficient process.

Human developers, in contrast, thrive on clarification. They would immediately ask about the sort order, data types, or handling of duplicates. ClarifyGPT aims to give this crucial skill to LLMs.

How ClarifyGPT Works: A Four-Stage Masterclass in Clarification

ClarifyGPT is a framework that wraps around an existing LLM (like GPT-4 or ChatGPT). It works in four distinct stages to intelligently decide when to ask a question and what question to ask.

Before anything else, ClarifyGPT needs a way to test the code that will eventually be generated. It does this in a clever two-step process:

  1. Seed Initialisation: It prompts the LLM to generate a handful of complex, difficult, and corner-case test inputs for the given requirement.
  2. Type-Aware Mutation: To generate a large volume of tests without the cost and latency of constant LLM calls, it takes these seed inputs and “mutates” them. For an integer, it might add or subtract 1. For a list, it might add, remove, or duplicate an element.

This gives ClarifyGPT a robust suite of tests to work with in the next stage.

This is the heart of ClarifyGPT’s novelty. How does it know if a requirement is ambiguous? By checking if the LLM is consistent with itself.

Imagine you give a vague instruction, “Draw me a picture of a pet,” to ten different artists. If they all drew a golden retriever, your instruction was probably clear enough in context. But if you get back five dogs, three cats, a parrot, and a hamster, your instruction was undeniably ambiguous.

ClarifyGPT does exactly this with code:

  1. It takes the user’s requirement and asks the LLM to generate n different code solutions for it (by using a high temperature to encourage diversity).
  2. It then runs all n solutions against the test inputs generated in Stage 1.
  3. It compares the outputs.
  • If all solutions produce the same outputs, the requirement is deemed unambiguous. ClarifyGPT simply returns one of the generated solutions and stops. No unnecessary questions asked!
  • If the solutions produce different outputs, this is a massive red flag. The framework flags the requirement as ambiguous and proceeds to the next stage.

This consistency check is brilliant because it doesn’t require any pre-trained classifier for ambiguity. It uses the LLM’s own diverse interpretations as the signal.

Once an ambiguity is detected, ClarifyGPT needs to ask a good question. A vague question like “Can you clarify?” is useless. It needs to be targeted.

To do this, it uses a reasoning-based prompting technique similar to Chain-of-Thought (CoT):

  1. It takes at least two of the inconsistent code solutions from the previous stage (e.g., one that sorts ascending and one that sorts descending).
  2. It feeds these, along with the original requirement, into the LLM with a special prompt. The prompt essentially says:
  • “Here is a user’s requirement. I generated two different code solutions for it that produce different results. Please analyse the functional differences between these two pieces of code and formulate a targeted question to the user that would resolve this difference.”

The LLM then analyses that one solution uses sorted(list) while the other uses sorted(list, reverse=True), identifies the core difference, and generates the perfect question: “Should the sorting be in ascending or descending order?”

This is far more effective than just asking the LLM to guess what might be ambiguous about the text alone. By comparing concrete, functional implementations, it can pinpoint the exact source of ambiguity.

This final stage is straightforward.

  1. ClarifyGPT presents the generated questions to the user.
  2. The user provides answers (e.g., “Ascending order, please”).
  3. ClarifyGPT creates a new, refined prompt by appending the clarifications to the original requirement.
  4. It feeds this crystal-clear prompt back to the LLM to generate the final, correct code solution.

The Evaluation: Does It Actually Work?

The researchers conducted two rigorous evaluations to prove ClarifyGPT’s effectiveness.

1. Human Evaluation:
Ten participants used ClarifyGPT (with GPT-4 as the base model) on the MBPP-sanitised and MBPP-ET benchmarks. The results were striking:

  • On MBPP-sanitised, GPT-4’s Pass@1 score jumped from 70.96% to 80.80%. That’s a nearly 14% relative improvement.
  • The system was shown to be effective in a real-world, human-in-the-loop scenario.

2. Automated Evaluation with a User Simulator:
To run large-scale, reproducible tests, the team created a high-fidelity user simulator. This is a crucial contribution for anyone working on interactive AI.

  • How the Simulator Works: Since the benchmarks have ground-truth test cases, the simulator can accurately answer clarifying questions. It prompts an LLM with the question and the ground-truth test cases, essentially asking, “Given that this is the expected final behaviour, how would you answer this question?” This enables it to generate realistic, high-fidelity answers without human intervention.

The automated results confirmed the human study across four benchmarks and two models:

  • With GPT-4, the average Pass@1 score across four benchmarks improved from 68.02% to 75.75%.
  • With ChatGPT (gpt-3.5-turbo), the average score improved from 58.55% to 67.22%.

Crucially, ClarifyGPT significantly outperformed standard prompting, Chain-of-Thought, and another tool, GPT-Engineer, which tends to ask questions for all requirements, leading to unnecessary interaction and sometimes even confusing the LLM.

Why This is a Big Deal for AI Experts

ClarifyGPT represents a significant step forward in human-AI collaboration for software development.

  1. It Moves Beyond Guesswork: It introduces a principled, reliable method for LLMs to handle ambiguity, moving them from brittle guessers to collaborative partners.
  2. The “When” and “What” is Key: The paper’s core contribution isn’t just “asking questions,” but the intelligent mechanisms for deciding when to ask (consistency check) and what to ask (reasoning-based generation from inconsistent examples).
  3. Improved Efficiency and User Experience: By only asking questions when necessary and making them highly specific, it reduces the back-and-forth and respects the developer’s time.
  4. A Framework for Future Work: The user simulation method provides a valuable tool for evaluating interactive AI systems without the bottleneck of constant human studies.

In conclusion, ClarifyGPT is more than just a clever prompting trick. It’s a well-designed framework that endows LLMs with a fundamental component of intelligence: knowing what you don’t know and having the ability to find out. For anyone building tools on top of LLMs, this paper is a must-read.

Press enter or click to view image in full size
Fig. 1. The Overview of ClarifyGPT
Press enter or click to view image in full size
Fig. 2. List of basic type-aware mutations over input 𝑥 [29]
Press enter or click to view image in full size
Fig. 3. The details of the prompts used in ClarifyGPT
Press enter or click to view image in full size
Table 1. Statistics of benchmarks: the total number of problems in the benchmark (Problem Nums), the average number of test cases per problem (AVG.Tests per Problem), and the average/maximum/minimum number of prompt words in the benchmark (AVG/MAX/MIN.Words in Prompt).
Press enter or click to view image in full size
Table 2. The Pass@1(%) of ClarifyGPT (Human Feedback) and baselines on two code generation benchmarks. Numbers in red denote ClarifyGPT’s relative improvements compared to the Default.
Press enter or click to view image in full size
Table 3. The Pass@1(%) of ClarifyGPT (Simulated Feedback) and baselines on four code generation benchmarks. Numbers in red denote ClarifyGPT (Simulated Feedback)’s relative improvements compared to the Default.
Press enter or click to view image in full size
Table 4. Experimental results of ClarifyGPT with different number of demonstrations. Numbers in red denote the relative improvement of ClarifyGPT with different number of demonstrations compared to the Default.
Press enter or click to view image in full size
Fig. 4. Two real cases from HumanEval and MBPP generated by two baselines and our ClarifyGPT.

6. CodeGen: A conversational paradigm for program synthesis

In the fast-evolving world of AI, Large Language Models (LLMs) that can write code have shifted from a “holy grail” concept to a practical reality. While models like OpenAI’s Codex have demonstrated incredible capabilities, many of the most powerful ones remain closed-source, limiting research and accessibility.

Enter CODEGEN, a family of open-source LLMs for code from Salesforce Research. In their ICLR 2023 paper, the authors not only introduce a powerful, competitive model but also explore a novel and more intuitive paradigm for human-AI collaboration: Multi-Turn Program Synthesis.

What is CODEGEN?

  • An Open-Source Family of Models: CODEGEN is a series of autoregressive transformer models for code, with sizes up to a hefty 16.1 billion parameters. The authors have open-sourced the model checkpoints, the MTPB benchmark, and their custom JAX-based training library, JAXFORMER, democratizing access to state-of-the-art code generation.
  • A Specialised Training Journey: The models weren’t just trained on code. They underwent a sequential, three-stage training process that progressively specialised their skills.
  • A New Programming Paradigm: The paper’s core contribution is the study and validation of multi-turn program synthesis — breaking down a complex coding task into a series of smaller, conversational steps.
  • Competitive Performance: On standard benchmarks, CODEGEN proves to be a strong contender, performing on par with or even outperforming models of similar size, including versions of Codex.

The Training Journey: From Generalist to Python Specialist

The effectiveness of CODEGEN stems from its curated training pipeline. The models were trained sequentially on three distinct datasets:

  1. CODEGEN-NL (Natural Language): The journey begins with The Pile, a massive 825GB dataset of diverse English text. This initial training gives the model a broad understanding of natural language, grammar, and reasoning. The “-NL” models are generalists.
  2. CODEGEN-MULTI (Multi-Lingual): Next, the NL model is fine-tuned on BigQuery, a large dataset containing code from six popular languages (C, C++, Go, Java, JavaScript, and Python). This turns the generalist into a polyglot programmer, capable of understanding the syntax and patterns of multiple languages.
  3. CODEGEN-MONO (Mono-Lingual): Finally, the MULTI model is further specialised on BigPython, a huge, Python-only dataset. This hones its abilities to make it an expert in a single language.

As the results show, this specialisation pays off. The MONO models significantly outperform the MULTI models on Python tasks, which in turn trounce the NL models.

How Does CODEGEN Stack Up?

Before introducing their new paradigm, the researchers first validated CODEGEN on the standard HumanEval benchmark. This is a “single-turn” benchmark where the model is given a single, complete prompt (docstring, function signature, examples) and must generate a correct Python function.

The results are impressive:

  • Scaling Works: Performance consistently improves with model size and data specialisation (NL < MULTI < MONO).
  • Competitive with SOTA: The largest model, CODEGEN-MONO 16.1B, is competitive with OpenAI’s powerful 12B Codex model, achieving a pass@100 score of 75% (meaning given 100 attempts, it finds a correct solution 75% of the time).

An interesting insight from this phase was the link between prompt perplexity and success. Perplexity measures how “surprised” a model is by a sequence of text. The authors found that for problems the model successfully solved, the prompts had significantly lower perplexity.

Analogy: Imagine you’re giving instructions to an assistant. If your instructions are clear and align with their past experiences (low perplexity), they’re likely to succeed. If your instructions are confusing or unusual (high perplexity), they’re more likely to fail. This finding suggests that making the user’s intent easier for the model to understand is key to better code generation.

This is the perfect segue to their main idea.

The Big Idea: Multi-Turn Program Synthesis

The core innovation of this paper is moving beyond single, monolithic prompts to a conversational, step-by-step process.

Single-Turn Analogy: Giving a chef a single, complex, multi-page recipe and saying, “Make this.” They have to parse everything at once: the ingredients, the prep, the multiple cooking stages, the plating, and track all dependencies in their head.

Multi-Turn Analogy: Guiding that same chef step-by-step.

  • You: “First, finely dice one onion and two carrots.”
  • Chef: Dices the vegetables.
  • You: “Great. Now, sauté them in olive oil until translucent.”
  • Chef: Sautés the vegetables.

This multi-turn approach breaks a complex task into manageable sub-problems. The researchers hypothesised this would work better for two reasons:

  1. Reduced Cognitive Load: Each step is simpler for the model to understand and execute, reducing the search space for a correct solution.
  2. Mimics Real-World Data: Code on GitHub often has an interleaved pattern of natural language comments followed by a block of code. The model learns this pattern during pre-training, an “emergent capability” that can be leveraged for multi-turn synthesis.

The MTPB: A New Benchmark for a New Paradigm

To test this hypothesis, the team created the Multi-Turn Programming Benchmark (MTPB). It consists of 115 programming problems, each broken down into 3+ conversational turns. The model must not only solve the current turn’s prompt but also remember the context (variables, functions) from all previous turns.

Press enter or click to view image in full size
Figure 1: An illustrative example for the Multi-Turn Programming Benchmark, performing the task of extracting the user name of an email address. 1 Each problem consists of prompts pi and unit tests, where some prompts include templates (i.e. {input}) that are filled with test case inputs before it is fed to the model. In the displayed example, the input is a string containing abc.xyz@example.com, which replaces {input} in p2, and the expected output is abc xyz. 2 Our model conditions on the concatenation of interleaved past prompts and generated responses. 3 Generated responses from each turn are concatenated and executed, where the output is compared to the answer.

The Results Are In: Multi-Turn Wins Big

When comparing the multi-turn approach against a single-turn version (where all prompts were concatenated into one large prompt), the results were clear and compelling:

Press enter or click to view image in full size
Table 4: Comparison between multi- and concatenated single-turn specifications on perplexity (PPL) and program synthesis performance (as measured by pass rate) under CODEGEN-MONO models.

Key Findings from the Multi-Turn vs. Single-Turn comparison.

  • Higher Pass Rates: The multi-turn method dramatically improved the success rate across all model sizes.
  • Lower Perplexity: Just as they predicted, the models were less “confused” by the step-by-step prompts, exhibiting lower perplexity.
  • Greatest Impact on Hard Problems: The improvement was most significant for problems rated as “medium” or “hard.” For very easy problems, the largest models were smart enough to solve them in one go anyway, but for anything complex, breaking it down was a huge benefit.

Key Takeaways

  1. Open Source Matters: By releasing CODEGEN, JAXFORMER, and MTPB, Salesforce Research has provided invaluable tools for the entire AI community to build upon.
  2. Specialisation is Key: The sequential training pipeline (NL -> MULTI -> MONO) demonstrates a powerful method for creating expert models.
  3. Interaction is the Future: The success of multi-turn synthesis suggests that the future of AI-assisted programming isn’t just about giving a single command, but about having a collaborative dialogue with the AI, guiding it step-by-step to build complex software.
  4. Clarity is King: The link between lower prompt perplexity and higher success is a crucial lesson. The easier we can make our intent for the model to understand, the better the results will be. Breaking down problems is a fantastic way to achieve that clarity.

This work not only gives us a new set of powerful, open tools but also points toward a more intuitive and effective way to collaborate with our AI coding partners.

Press enter or click to view image in full size
Table 1: Evaluation results on the HumanEval benchmark. Each pass@k (where k ∈ {1, 10, 100}) for each model is computed with three sampling temperatures (t ∈ {0.2, 0.6, 0.8}) and the highest one among the three are displayed, which follows the evaluation procedure in Chen et al. (2021). Results for the model marked with ∗ are from Chen et al. (2022).

7. Toolformer: Language models can teach themselves to use tools

Large Language Models (LLMs) live a life of paradox. They can write beautiful poetry, explain quantum physics, and generate flawless code, yet they can stumble on simple arithmetic (7 * 8 = ?), confidently invent (“hallucinate”) facts, and remain blissfully unaware of events that happened yesterday. They are, in a sense, brilliant but book-smart geniuses locked in a library with no connection to the outside world.

This “Toolformer: Language Models Can Teach Themselves to Use Tools” paper from Meta AI Research presents an elegant and powerful solution to this problem. Instead of just making the model bigger, they’ve given it a toolbox and, more importantly, taught it how to use those tools all by itself.

Let’s dive into how this groundbreaking approach works and why it’s a significant step toward more capable and reliable AI.

What is Toolformer?

  • A Self-Taught Handyman: Toolformer is a language model (based on a 6.7B parameter GPT-J) that has learned to use external tools like a calculator, a Q&A system, and a search engine by calling simple APIs.
  • The Secret Sauce is Self-Supervision: The model learns when and how to use these tools without massive human annotation. It generates a massive dataset of potential tool uses and then uses its own judgment to filter for only the ones that actually help it with its primary task: predicting the next word.
  • A Versatile Toolbox: The researchers equipped the model with five distinct tools:
  1. Question Answering System: For direct fact lookups.
  2. Wikipedia Search Engine: For retrieving broader information snippets.
  3. Calculator: For precise mathematical calculations.
  4. Machine Translation System: To handle multilingual text.
  5. Calendar: To provide awareness of the current date.
  • Stunning Results: With its tools, the relatively small Toolformer (6.7B) substantially outperforms its base model and often surpasses the much, much larger GPT-3 (175B) on tasks that require factual accuracy, math, and up-to-date information, all while maintaining its core language abilities.

The Big Idea: Learning by Helping Yourself

Previous attempts to give LLMs tools either required enormous amounts of expensive human-labelled data or were limited to very specific, pre-defined tasks. The genius of Toolformer is that it flips the script: the model teaches itself by discovering what helps it.

The core intuition is this: An API call is “good” or “useful” if the information it returns makes the model’s job of predicting the rest of a sentence easier.

Analogy: Imagine you’re an LLM trying to complete the sentence: “The capital of France is ___.” You could just guess based on your training data. But what if you could pause and ask an expert? You might think to yourself, “I’ll make an API call: [QA(“What is the capital of France?”)].” An external tool then provides the answer: → Paris. With this new information, “Paris” becomes an extremely easy word to predict next. The API call was clearly helpful!

Toolformer formalises this intuition into a simple, three-step, self-supervised process to create its own training data.

The Three-Step Training Process

The researchers started with a standard language modelling dataset (a subset of CCNet) and a pre-trained GPT-J model. They then augmented this dataset using the following loop:

Step 1: Sample API Calls (Brainstorming)
For a given text, the model is prompted with a few examples of how to use an API. It then reads through the text and “brainstorms” potential API calls it could make. For the text “…the Spanish word for turtle,” the model might generate a call like [MT(“tortuga”)]. For “…400 out of 1400 participants,” it might generate [Calculator(400 / 1400)].

Step 2: Execute API Calls (Getting Answers)
All these potential API calls are executed. The Machine Translation tool returns “turtle”, and the Calculator returns “0.29”.

Step 3: Filter API Calls (Self-Correction)
This is the most critical step. For each API call, the model asks itself a crucial question: “Does seeing the result of this API call make me better at predicting the next few words in the original text?”

This is measured by calculating the model’s loss (a measure of prediction error). If the loss is lower with the API result than without it, the API call is deemed “useful” and is permanently inserted into the text. If it doesn’t help or makes things worse, it’s discarded.

After running this process over a huge dataset for all five tools, the result is a new, augmented dataset C* filled with examples of the model intelligently calling APIs to gather information. The original GPT-J model is then fine-tuned on this new dataset, effectively learning from its own curated examples.

Putting Toolformer to the Test

The results are striking, demonstrating how a smaller model with the right tools can punch far above its weight.

  • Factual Probing (LAMA Benchmark): When asked to complete factual statements like “The New England Journal of Medicine is a registered trademark of ___,” Toolformer learned to use the QA tool almost 98% of the time. This allowed it to outperform both OPT (66B) and GPT-3 (175B), models that are 10x and 25x its size.
  • Mathematical Reasoning (Math Datasets): On math word problems, Toolformer used the calculator tool in nearly all cases. Its performance skyrocketed, more than doubling the score of its tool-less version and again clearly beating GPT-3. This directly fixes one of the most well-known weaknesses of LLMs.
  • Question Answering (WebQS, NQ, etc.): Using the Wikipedia Search tool, Toolformer significantly improved its QA capabilities, outperforming all same-sized baselines. It still trailed the massive GPT-3, which the authors suggest might be due to the simplicity of their search tool and the model’s inability to interact with it (e.g., refine a search query).
  • Temporal Awareness (DATESET): On a custom dataset requiring knowledge of the current date (e.g., “What day of the week was it 30 days ago?”), Toolformer learned to query the Calendar API, boosting its accuracy from 5.9% to 27.3%, while giant models like GPT-3 scored less than 1%.

Crucially, the fine-tuning did not degrade the model’s core language skills. When the tools were disabled, Toolformer performed just as well on language modelling benchmarks as a model fine-tuned on the same data without any API calls.

Key Takeaways

  1. The Future is Hybrid: Toolformer demonstrates a powerful hybrid approach, combining the generative, pattern-matching strengths of LLMs with the reliable, factual, and precise nature of external tools.
  2. Self-Supervision Unlocks New Skills: The most significant contribution is showing that models can learn complex behaviours like tool use without needing a massive, human-curated dataset. The “is this helpful for prediction?” signal is surprisingly effective.
  3. Efficiency Over Scale: A 6.7B parameter model with a calculator is better at math than a 175B parameter model without one. This suggests that for many real-world applications, equipping smaller, more efficient models with the right tools is a better path than simply scaling up model size indefinitely.
  4. Fixing Core Flaws: This approach directly addresses some of the most criticised flaws in LLMs: factual hallucination, mathematical errors, and a lack of real-time knowledge.

While the authors note limitations — like the inability to “chain” tools (use the output of one as the input to another) — Toolformer represents a fundamental shift in how we can build and enhance language models, moving them from isolated “brains in a vat” to capable agents that can interact with and leverage the digital world to become vastly more useful and reliable.

Figure 1: Exemplary predictions of Toolformer. The model autonomously decides to call different APIs (from top to bottom: a question answering system, a calculator, a machine translation system, and a Wikipedia search engine) to obtain information that is useful for completing a piece of text.
Press enter or click to view image in full size
Figure 2: Key steps in our approach, illustrated for a question answering tool: Given an input text x, we first sample a position i and corresponding API call candidates c 1 i , c2 i , . . . , ck i . We then execute these API calls and filter out all calls which do not reduce the loss Li over the next tokens. All remaining API calls are interleaved with the original text, resulting in a new text x ∗ .
Figure 3: An exemplary prompt P(x) used to generate API calls for the question answering tool.
Press enter or click to view image in full size
Table 1: Examples of inputs and outputs for all APIs used.
Press enter or click to view image in full size
Figure 4: Average performance on LAMA, our math benchmarks and our QA benchmarks for GPT-2 models of different sizes and GPT-J finetuned with our approach, both with and without API calls. While API calls are not helpful to the smallest models, larger models learn how to make good use of them. Even for bigger models, the gap between model predictions with and without API calls remains high.
Press enter or click to view image in full size
Table 10: Examples of API calls for different tools, sorted by the value of L − i − L + i that is used as a filtering

8. ToolCoder: Teach code generation models to use API search tools

In the rapidly evolving field of AI-driven code generation, Large Language Models (LLMs) like CodeGen and GPT-3.5 have demonstrated remarkable capabilities. However, a persistent and critical challenge remains: their struggle with Application Programming Interface (API) selection. These models often hallucinate non-existent APIs, misuse existing ones, or fail when dealing with lesser-known or private libraries not present in their training data.

This paper is titled “ToolCoder: Teach Code Generation Models to use API search tools,” from Peking University, introduces a novel and highly effective approach to this problem. Instead of relying solely on the model’s parametric knowledge, ToolCoder teaches the model to behave like a human developer: when in doubt, use a search tool.

This summary will break down the core concepts, methodology, and striking results of ToolCoder, providing an in-depth look for AI practitioners.

1. The Core Problem: LLMs are Poor API Librarians

The authors identify three primary failure modes for LLMs in API-heavy code generation:

  1. Generating Non-Existent APIs: For popular libraries like NumPy, a model might generate a plausible-sounding but non-existent method call, like a.count(2) on a NumPy array (the correct approach might involve np.count_nonzero(a == 2)).
  2. Generating Unqualified APIs: A model might select a real API that doesn’t meet the specific requirements. For instance, when asked to sum all values in a Pandas DataFrame into a single numeric value, a model might generate df.sum(), which returns a Pandas Series, not a scalar numeric value.
  3. Lack of Knowledge for Private/New Libraries: This is the most severe issue. For internal company libraries or newly released open-source ones (like TorchData), which are absent from pre-training corpora, models are effectively guessing. The paper shows that for private libraries, the API-related error rate can exceed 90%.

These issues highlight a fundamental limitation: a model’s knowledge is static and bounded by its training data. The world of software, with its ever-expanding universe of libraries and APIs, is dynamic.

2. The Solution: The ToolCoder Pipeline

Inspired by the human developer workflow (summarise need -> search Google/docs -> select API), ToolCoder integrates external API search tools directly into the code generation process. The methodology is a clever three-stage pipeline:

Stage 1: Automatic Dataset Annotation with ChatGPT (Creating the “Textbook”)

To teach a model to use a tool, it needs to see examples of tool usage. Manually creating such a dataset would be prohibitively expensive. The authors’ key innovation here is an automated annotation process using ChatGPT.

  • Process: They take a large, existing code dataset (CodeSearchNet) and use ChatGPT with a few-shot prompt to “annotate” it. The prompt instructs ChatGPT to identify where an API is used and insert a special “tool call” block right before it.
  • Format: The tool call is standardised into a specific format:
    <API>APISearch(query)→answer</API>
  • <API> & </API>: Special tokens that demarcate the tool interaction.
  • Query: A natural language description of the API’s function that the model learns to generate. Example: “Gives a new shape to an array without changing its data.”
  • Answer: The actual API call that should be used. Example: np.reshape.

After annotation and cleaning, they created a dataset of 53,000 code samples enriched with these “chain-of-thought” style tool calls.

Stage 2: Parameter-Efficient Fine-Tuning (The “Study Session”)

With the annotated dataset, the authors don’t train a new model from scratch. Instead, they fine-tune existing pre-trained code models (CodeGen-350M and CodeGen-2B). To make this process highly efficient, they employ Low-Rank Adaptation (LoRA).

By freezing most of the model’s weights and only training small, low-rank matrices injected into the attention layers, they reduce the number of trainable parameters by over 99% (e.g., only 0.09% for CodeGen-2B). This allows the powerful CodeGen models to be fine-tuned on the new task using a single consumer-grade GPU (like an RTX 2080), making the approach accessible and computationally inexpensive.

Stage 3: Inference with an External Tool (The “Open-Book Exam”)

This is where the magic happens. During inference, the fine-tuned ToolCoder model generates code token-by-token. The process is augmented as follows:

  1. The model generates code until it outputs the special <API> token. This signals that it needs external help.
  2. The generation process is paused. The model then continues generating the search query (e.g., “Selects a single row of data from a DataFrame.”).
  3. This query is extracted, and a real, external API search tool is called.
  • For public libraries (NumPy, Pandas), the tool is an actual web search engine (DuckDuckGo) configured to search trusted sites like StackOverflow and official documentation.
  • For private libraries, the tool is a documentation searcher using a BM25 retrieval algorithm over the library’s local documentation.
  1. The top result (the answer) from the tool is retrieved (e.g., pandas.DataFrame.iloc).
  2. This answer is formatted back into the →answer</API> string and injected into the model’s context.
  3. The model then resumes generation, now equipped with the correct API suggestion, and generates the final code (e.g., df.iloc[n][column_name]).

Analogy: Imagine an AI student taking a programming test. It writes the solution until it encounters a function it’s unsure about. It then stops, writes down a specific question on a piece of paper (query), hands it to a librarian (the APISearch tool), gets a specific book and page number (answer), and then uses that information to confidently complete the problem.

3. Experimental Results: A Clear Win

ToolCoder was evaluated against a suite of strong baselines, including CodeGen, GPT-3.5, and specialised API-oriented models like CERT and CodeGenAPI, on five benchmarks. The metric used was pass@k, the standard for functional correctness in code generation.

Key Finding 1: Dominance of Public Libraries
On benchmarks for NumPy, Pandas, and the “unseen” TorchData library, ToolCoder significantly outperformed all baselines.

  • It achieved an average improvement of at least 6.21% on pass@1 across all five benchmarks compared to the best SOTA method.
  • Remarkably, the relatively small ToolCoder-2B model achieved performance comparable to GPT-3.5, demonstrating the immense value of tool integration over sheer model size.

Key Finding 2: Generalisation to Private Libraries
This is where ToolCoder truly shines. On the private library benchmarks (MonkeyEval, BeatNumEval), where models have no prior knowledge, standard models failed almost completely.

  • By simply swapping the search tool from the web engine to the local documentation searcher, ToolCoder adapted seamlessly.
  • It showed stable and significant improvements, proving its ability to generalise to entirely new, domain-specific coding environments.

Key Finding 3: Ablation Studies Prove Every Component Matters
The authors conducted rigorous ablation studies to validate their design choices:

  • Dataset is Crucial: Training on the original code without the <API> annotations yielded no significant improvement. This confirms that the model must be explicitly taught how to use tools. Furthermore, removing the query from the annotation (i.e., just learning to call a generic search) caused a drastic performance drop, proving that learning to formulate the right question is a critical part of the process.
  • The External Tool is Essential: They ran an experiment where the model generated the entire tool-call block but didn’t actually call the external tool (the “NoTool” setting). While this “chain-of-thought” process alone provided a slight boost to public libraries, performance plummeted in private ones. This proves that while the fine-tuning teaches the model a reasoning process, the actual real-time information retrieval from the tool is indispensable for handling out-of-domain knowledge.
  • LoRA is Effective and Efficient: The parameter-efficient fine-tuning with LoRA achieved results nearly identical to full-model fine-tuning while reducing training time from 29 hours to 6 hours and shrinking the trainable parameter count from 350M to 0.65M.

4. Conclusion and Implications

ToolCoder presents a paradigm shift for code generation. It moves away from the idea of creating monolithic LLMs that must memorise the entire programming universe towards a more practical, scalable, and human-like model of synergising internal knowledge with external information retrieval.

Key Contributions:

  • A Novel Framework: The first successful integration of programming-specific search tools into the code generation loop.
  • An Automated Data Pipeline: A low-cost, effective method for creating tool-use datasets using an existing LLM (ChatGPT), a technique applicable to other domains.
  • State-of-the-Art Performance: Demonstrably superior results, especially in the challenging and practical scenarios of unseen and private library usage.

The success of ToolCoder opens up exciting future research directions. One can imagine models augmented with a whole suite of developer tools, linters, debuggers, profilers, and even version control systems, leading to more robust, reliable, and powerful AI programming assistants.

Fig. 1. An illustrative example of the process of human programmers selecting the proper API during coding. Programmers summarize their demands into a query (remove single-dimensional entries) and use the search engine tool or documentation search tool to get the proper API suggestion (np.squeeze).
Fig. 2. Failure Cases of CodeGen-2B model in selecting APIs, including generating non-existing APIs on public libraries (Case 1), generating unqualified APIs (Case 2), and lack of API-related knowledge on private libraries (Case 3).
TABLE I COMPARISONS OF TWO TYPES OF SEARCH TOOLS FOR API SELECTION.
TABLE II STATISTICS OF THE ANNOTATION DATASET.
Press enter or click to view image in full size
Fig. 3. The pipeline of our approach ToolCoder. The pipeline has three main parts: (1) Automatically Annotate Tool-augmented Dataset with ChatGPT, (2) Parameter-efficient Fine-tune existing pre-trained code generation model with the annotated dataset, and (3) Inference of the fine-tuned model enhanced with API search tools.
Fig. 4. An exemplary prompt used to generate API-augmented datasets for the API search tool. In our setting, We selected a total of three human-written input-output pairs as part of the prompt, using three libraries: numpy, pandas, and matplotlib.
TABLE III PASS RATE OF MODELS ON PUBLIC LIBRARY BENCHMARKS
TABLE IV PASS RATE OF MODELS ON PRIVATE LIBRARY BENCHMARKS
TABLE V ABLATION STUDIES ON DATASET SETTINGS. WE CONDUCT EXPERIMENTS ON TOOLCODER-350M.

9. RepairAgent: An autonomous, LLM-based agent for program repair

The field of Automated Program Repair (APR) has rapidly been dominated by Large Language Models. We’ve seen a swift evolution from single-shot prompts (“here’s buggy code, fix it”) to more sophisticated iterative approaches that feed back compilation errors and test failures to the LLM. However, these methods, while effective, treat the LLM as a powerful but passive component in a rigid, hard-coded loop.

A groundbreaking paper from the University of Stuttgart and UC Davis, “RepairAgent: An Autonomous, LLM-Based Agent for Program Repair,” argues for a fundamental paradigm shift. Instead of just prompting an LLM, they treat it as an autonomous agent capable of planning, reasoning, and using tools to actively debug code in a process that strikingly mirrors a human developer’s workflow.

This summary will deconstruct RepairAgent’s architecture, its novel approach to agent-driven repair, and the impressive results that establish a new state-of-the-art in the field.

1. The Core Limitation of Existing LLM-Based Repair

The authors identify a key weakness in current SOTA techniques (like ChatRepair and ITER): their feedback loops are fixed. The process is typically:

  1. Prompt LLM with buggy code.
  2. LLM generates a patch.
  3. Apply the patch and run tests.
  4. If tests fail, feed the error messages back into the prompt.
  5. Repeat.

This is a reactive process. The LLM can’t decide on its own to gather more context. It can’t say, “This error is confusing, I need to read the calling function,” or “I wonder how this API is used elsewhere in the codebase.” A human developer, in contrast, constantly interleaves understanding, searching, and experimenting. RepairAgent is the first system designed to replicate this intelligent, autonomous exploration.

2. The RepairAgent Architecture: A Thinking and Acting Agent

RepairAgent is composed of three core components that work in a continuous cycle: an LLM Agent, a set of Tools, and a Middleware orchestrator.

  1. The LLM Agent (The “Brain”): This is a general-purpose LLM (GPT-3.5 in the paper) that, instead of just generating code, plans its next action.
  2. The Tools (The “Hands and Eyes”): A suite of 14 specialised APIs the agent can call to interact with the code environment.
  3. The Middleware (The “Orchestrator”): This component manages the interaction, parsing the agent’s desired action, calling the appropriate tool, and updating the agent’s context with the result.

The process is iterative. In each cycle, the middleware sends a comprehensive “dynamic prompt” to the agent, the agent responds with a thought process and a command to execute, the middleware executes it, and the cycle repeats.

3. The Agent’s Mind: Dynamic Prompting and Guided Autonomy

The key to enabling autonomy is the dynamic prompt. It’s not just a piece of buggy code; it’s a structured workspace that serves as the agent’s short-term and long-term memory. It contains several sections:

  • Static Sections: Defines the agent’s role, goals (e.g., “Locate the bug,” “Suggest simple fixes”), and general guidelines.
  • Dynamic Sections (The crucial part):
  • State Description: Informs the agent of its current state in a Finite State Machine (FSM). This FSM provides gentle guidance, mimicking a developer’s workflow: Understand the bug -> Collect information -> Try to fix the bug. This prevents the agent from getting lost in aimless exploration while still allowing it to freely transition between states.
  • Available Tools: Lists the specific tools the agent can use in its current state.
  • Gathered Information: This is the agent’s memory. It’s a growing log of all previous tool outputs (e.g., code snippets it has read, search results, test failures).
  • Last Executed Command and Result: Reminds the agent what it just did and what the outcome was.

This dynamic prompt allows the agent to build a mental model of the bug over multiple cycles, accumulate knowledge, and make informed decisions about its next step.

4. The Agent’s Toolbox: A Developer’s IDE in API Form

The 14 tools provided to RepairAgent are the cornerstone of its capability. They can be grouped into four categories:

Reading and Extracting Code: Tools like read_range and extract_method allow the agent to read specific parts of the codebase, just as a developer would navigate files.

Searching and Generating Code: This is a superpower.

  • search_code_base: Performs a keyword search across the entire project to find relevant code.
  • find_similar_api_calls: Given an incorrect API call, it can find correct usage examples elsewhere in the code, a common human debugging strategy.
  • generate_method_body: Invokes another LLM (like a mini-Copilot) to generate a method’s body from its signature and context.

Testing and Patching:

  • run_tests and run_fault_localization: Standard tools for validation and pinpointing bug locations.
  • write_fix: The agent’s primary action tool. It applies a patch (which can be multi-line and multi-file) and runs the test suite. If the fix fails, the changes are automatically reverted.

Control and Reasoning: These are meta-tools that manage the agent’s thought process.

  • express_hypothesis: The agent formally states its theory about the bug’s root cause.
  • discard_hypothesis: If a theory is proven wrong, the agent can discard it and go back to the information-gathering stage.
  • goal_accomplished: Terminates the process upon success.

Analogy: RepairAgent works like a senior developer paired with an intern. The agent (senior dev) decides the strategy: “I need to see how cfa.createEdge is used in other files.” It writes this down as a command. The middleware and tools (the intern) execute the command, run the search, and bring back the results. The agent then analyzes these results and decides the next strategic step.

5. Experimental Results: A New State-of-the-Art

RepairAgent was evaluated on the entire Defects4J benchmark (835 bugs) and compared against SOTA models like ChatRepair and ITER.

  • Overall Performance: RepairAgent correctly fixes 164 bugs, surpassing the previous SOTA (ChatRepair with 162). Crucially, it fixes 39 bugs that no prior technique could handle.
  • Strength in Complexity: Where RepairAgent truly excels is on complex bugs. It correctly fixes 46 multi-line bugs, significantly more than ChatRepair (29) and ITER (14). Its ability to actively search for “repair ingredients” and context pays off for non-trivial fixes.
  • Cost-Effectiveness: The process is remarkably cheap. The median cost is 270,000 tokens per bug, which translates to just 14 cents with GPT-3.5’s pricing. The median repair time is about 15 minutes, with 99% of that time spent running tools (mostly tests), not waiting for the LLM.
  • Generalisation: On a newer dataset (GitBug-Java) with no risk of training data leakage, RepairAgent performed well on single-line bugs, though it struggled with the more complex multi-line bugs in that specific dataset, showing that its effectiveness is still tied to bug complexity.

6. Why It Works: Lessons from the Ablation Study

The authors conducted several ablation studies that validate the agent-based design:

  • Without Search Tools: Performance was halved and costs doubled. The agent was flying blind, unable to gather external context.
  • Without the State Machine (FSM), the agent became unfocused, often jumping to suggest incorrect fixes without proper information gathering. Guided autonomy is key.
  • Without Long-Term Memory (single-cycle memory): Performance plummeted. The agent couldn’t build on previous findings, repeatedly asking for the same information.

7. Conclusion: The Dawn of Software Engineering Agents

RepairAgent is more than just an incremental improvement in automated program repair. It’s a proof of concept for a new class of autonomous software engineering agents. By equipping an LLM with memory, a structured reasoning process, and a rich set of tools, the authors have created a system that moves from passively responding to actively solving problems.

This work paves the way for future agents that could perform even more complex tasks like large-scale refactoring, performance optimisation, or even implementing new features, fundamentally changing how we leverage AI in the software development lifecycle.

Press enter or click to view image in full size
Fig. 1: Overview of RepairAgent.
TABLE I: Sections of the dynamically updated prompt.
Fig. 2: State machine to guide selection of tools.
Fig. 3: JSON format of the response of the model.
Fig. 4: Example of a response of the repair agent.
Press enter or click to view image in full size
TABLE II: Repair-related tools invoked by RepairAgent.
Fig. 5: Example of patch given to the write fix tool.
TABLE III: Results on Defects4J.
TABLE IV: Distribution of fixes by location type
Fig. 6: Intersection of the set fixes with related work.
Fig. 7: Closure-14, bug fixed by RepairAgent.
Fig. 8: Time-27, bug fixed by RepairAgent.
TABLE V: Results on GitBug-Java
TABLE VI: Different configurations of RepairAgent
Press enter or click to view image in full size
Fig. 9: Distribution of cost metrics per bug (time, number of token, and monetary costs).
Fig. 10: Frequency of tool invocations (average per bug).

10. Enhancing LLM agents for code generation with possibility and pass-rate prioritised experience replay

The authors introduce a new method, the BTP pipeline, to make Large Language Models (LLMs) like GPT more efficient and effective at writing code. The core idea is to stop throwing away failed code attempts. Instead, the model stores these “almost-correct” programs in a memory buffer and intelligently replays them for fine-tuning, learning from its mistakes to improve future performance.

1. The Problem: The High Cost of Perfection in Code Generation

When an LLM generates code, it’s an all-or-nothing game. If a program has 100 lines and just one token is wrong, a misplaced semicolon or an incorrect variable name, the entire program fails its tests. This is known as a sparse reward problem.

  • Analogy: Imagine you’re taking a 100-question exam where you only get a passing grade if you answer every single question correctly. Getting 99 questions right gives you the same score as getting zero right: a fail.

Traditional code-generating LLMs operate this way. They generate a vast number of programs, test them, and discard almost all of them because they aren’t perfect. This is incredibly inefficient and wastes a huge amount of computational resources. The paper argues that these “failed” programs, especially those that are nearly correct, are valuable learning opportunities that are being thrown away.

2. The Solution: The BTP Pipeline with Prioritised Experience Replay

To solve this, the researchers propose the BTP (Beam search, Testing, Prioritised experience replay) pipeline. This framework is designed to intelligently reuse failed attempts.

At its heart is a concept borrowed from reinforcement learning called Experience Replay (ER).

The BTP pipeline consists of three main phases:

Phase 1: Beam Search Sampling

Instead of greedily picking the single most likely next word at each step, the model uses beam search. This technique keeps track of a small number (k) of the most promising code sequences at each step of generation.

  • Analogy: A chess grandmaster doesn’t just consider the single best next move. They keep several promising lines of play (“beams”) in their mind and explore them a few moves deep. Beam search allows the LLM to explore multiple high-quality code paths simultaneously, increasing the chance of finding a correct one and providing a richer set of examples to learn from.

The top k completed programs, along with their generation probabilities, are stored in the Experience Replay buffer.

Phase 2: Testing Phase

Each program generated in Phase 1 is run against a set of unit tests. The model calculates the pass rate, the percentage of tests the code successfully passes. This pass rate is then added to the program’s entry in the Experience Replay buffer. Now, each entry contains the code, its generation probability, and its objective performance.

Phase 3: Possibility and Pass-rate Prioritised Experience Replay (PPER)

This is the most innovative part of the paper. The model is now fine-tuned using the programs stored in the buffer. However, instead of picking programs randomly, it prioritises them based on a new metric called P2Value.

The formula for P2Value is:
P2Value = α * Possibility + (1 — α) * Pass Rate

Let’s break this down:

  • Possibility (P(ti)): This is the probability the LLM assigned to generating that specific program. A high possibility means the model was “confident” in this code, based on its training. It’s code that “looks right” to the model.
  • Pass Rate: This is the objective measure of how well the code actually works. A high pass rate means the code is functionally close to being correct.
  • α (alpha): This is a hyperparameter that balances the two.
  • If α is close to 1, the model prioritises replaying code that it was very confident in, reinforcing its existing knowledge.
  • If α is close to 0, it prioritises code that performed well on tests, even if it was a surprising or low-probability solution. This helps the model learn new, effective patterns.

Programs with a higher P2Value are more likely to be selected for the fine-tuning batch. This ensures the model spends its time learning from the most valuable examples, either because they were almost correct or because they represent a pattern the model strongly believes in.

The paper also explores two ways of sampling based on P2Value: one based on the value directly and another based on the program’s rank, which is more robust to outliers.

3. Key Experiments and Findings

The researchers tested their BTP pipeline in several scenarios using models like GPT-2, GPT-Neo, and WizardCoder on standard code generation benchmarks (APPS, CodeContests, HumanEval).

  • Smarter Models Can Teach Weaker Ones:
  • They used powerful models like GPT-4 to generate code samples and then used the BTP pipeline to fine-tune weaker models like GPT-2.
  • Result: This led to a massive performance improvement. For example, the base GPT-2 had a pass rate of ~12% on the APPS Intro dataset, but after being fine-tuned on GPT-4’s samples using BTP, its pass rate soared to ~51%.
  • Models Can Teach Themselves:
  • They had models like GPT-2 and WizardCoder generate their own code samples and then used BTP to fine-tune themselves.
  • Result: The models showed modest but consistent improvement. This demonstrates that the pipeline enables a form of self-improvement, where a model can refine its abilities by learning from its own experience.
  • The Best Fine-Tuned Model is Competitive:
  • Their best-performing model (GPT-Neo fine-tuned on GPT-4 data) didn’t surpass GPT-4, but it significantly closed the performance gap compared to the original GPT-Neo and became competitive with other strong baseline models like CodeLlama.
  • Diverse Training Data is Better:
  • Fine-tuning a model on a mixture of datasets (APPS, CodeContests, HumanEval) resulted in better all-around performance compared to training on just one dataset.

4. Limitations and Future Work

The authors acknowledge a key limitation: the method relies heavily on the availability of good test cases. If a programming problem has very few tests, the “pass rate” signal is weak (often just 0% or 100%), making it difficult for the P2Value metric to identify “almost-correct” solutions. Future work could explore ways to generate or find similar test cases to strengthen this signal.

5. Why This Paper Matters

This paper presents a practical and intuitive way to make the process of training code-generating LLMs more efficient. By treating failed code not as waste but as a valuable learning resource, the BTP pipeline allows models to learn from their near-misses. This approach is not only resource-efficient but also mirrors how humans learn analysing mistakes and iterating. It’s a significant step toward creating more robust, efficient, and intelligent AI programming assistants.

Press enter or click to view image in full size
Figure 1: The pass rate of GPT2-Wizard using BTP in different α
Press enter or click to view image in full size
Table 1: Result of ”Better models help fine-tune normal models” experiment. On the top and bottom of the table, we show the
performance of GPT-2 and GPT-Neo, and how they perform after they are fine-tuned by programs sampled by better models
including GPT-4-turbo, GPT-3.5-turbo, CodeLlama- 34B, WizardCoder-34B
Press enter or click to view image in full size
Table 2: Result of ”Models help fine-tune themselves” experiment. we show the performance of GPT-2, GPT-Neo, WizardCoder
and how they perform after they are fine-tuned by programs sampled by themselves
Press enter or click to view image in full size
Table 3: Comparison of fine-tuned code models and baseline models on the APPS dataset across different difficulty levels.
Press enter or click to view image in full size
Table 4: Performance of GPT-4-turbo fine-tuned with BTP pipeline on different datasets compared with baseline models

11. Self-collaboration code generation via ChatGPT

The authors propose a clever framework that dramatically improves an LLM’s ability to handle complex coding tasks by making it simulate a human software development team. Instead of one AI trying to solve a problem, multiple instances of the same AI are given distinct roles like an Analyst, a Coder, and a Tester who then collaborate to plan, write, and debug the code. This “self-collaboration” leads to significant performance gains, turning a single powerful but flawed LLM into a well-organised and more reliable coding powerhouse.

1. The Problem: The Lone Genius Can’t Build a Skyscraper

Large Language Models (LLMs) like ChatGPT are incredibly good at writing small, self-contained functions. But when faced with a complex, multi-step problem, they often fail. They might misunderstand a subtle requirement, miss edge cases, or generate code that doesn’t quite work, with no way to correct themselves.

  • Analogy: This is like asking a single, brilliant programmer to single-handedly design, build, and test a complex application. They might be a genius, but they’re bound to make mistakes, miss requirements, and get stuck. Human software development solved this problem centuries ago with a simple solution: teamwork. Different people with different skills (analysis, coding, testing) collaborate, review each other’s work, and catch mistakes, leading to a much higher-quality final product.

The paper asks: can we make an LLM do the same thing, but with itself?

2. The Solution: A Self-Collaboration Framework

The researchers designed a framework that gets an LLM to simulate a software development team. This is achieved through two key steps:

Step 1: Division of Labour via “Role Instructions”

The core mechanism is assigning roles to different instances of the LLM using carefully crafted prompts called role instructions. A single prompt transforms the general-purpose ChatGPT into a domain-specific “expert.”

  • Analogy: Think of a talented actor. By giving them a script and a role (“You are a grizzled detective”), they adopt a specific persona, vocabulary, and way of thinking. The LLM does the same.

The paper creates a basic but effective team based on the classic “waterfall” software development model:

  • The Analyst: Its prompt tells it to act as a requirements analyst. Its job is to read the user’s request, break it down into smaller, manageable sub-tasks, and create a high-level plan. It does not write any code.
  • The Coder: Its prompt tells it to act as a developer. It receives the plan from the Analyst, and its job is to write the actual code. Later, it will receive feedback from the Tester and its job becomes fixing the code.
  • The Tester: Its prompt tells it to act as a QA tester. It receives the code from the Coder and its job is to review it, identify potential bugs, find missing edge cases, and write a detailed “test report.” Importantly, this is a simulated test; the Tester LLM uses its vast knowledge of code to reason about potential flaws without actually running the code.

Step 2: Collaboration via a “Shared Blackboard”

The different roles need a way to communicate. The framework facilitates this by passing the output of one role as the input to the next.

  • Analogy: This works like a project management board or a shared Google Doc. The Analyst posts the plan. The Coder reads the plan and posts the code. The Tester reads the code and posts a bug report. The Coder then reads the bug report and posts the fixed code.

This process creates a feedback loop. The initial code is rarely perfect, but the Tester’s feedback allows the Coder to iteratively refine their work. This cycle can repeat several times until the Tester confirms the code is correct or a maximum number of interactions is reached.

3. Key Experiments and Findings

The researchers put their virtual team to the test on multiple code generation benchmarks, and the results were striking.

  • Massive Performance Boost: On standard benchmarks like HumanEval and MBPP, the self-collaboration framework improved the base ChatGPT’s (GPT-3.5) success rate (Pass@1) by a relative margin of 29.9% to 47.1%. This improvement was even more pronounced on datasets with more difficult edge cases.
  • Every Role is Valuable: Through ablation studies (removing one role at a time), they proved that each member of the virtual team contributes. The full Analyst-Coder-Tester team performed the best, demonstrating the power of a complete software development lifecycle.
  • Role-Playing is Key: In a fascinating experiment, they compared their “role instruction” approach to just giving the LLM a list of tasks (“First, create a plan. Second, write code…”). The role-playing approach, where the LLM is told to act as an expert, consistently performed better. This suggests that providing a context or persona helps the LLM access and apply its knowledge more effectively.
  • It Scales with Model Power: The framework isn’t limited to one model. It showed improvements on various open-source models and worked exceptionally well on GPT-4, pushing its already high HumanEval score of 67% to an incredible 90.2%. This shows the framework’s ability to unlock the latent potential of even the most powerful models.
  • It Can Tackle Truly Complex Tasks: The most impressive demonstration was on repository-level tasks things a single LLM prompt could never handle. The self-collaboration team successfully built:
  • A complete, playable game using Pygame, including game logic, asset loading, and controls.
  • A functional weather forecast website with HTML, CSS, and JavaScript that calls a real-time API. In contrast, the base ChatGPT produced incomplete scripts that missed major requirements.

4. Why This Paper Matters

This work moves beyond simple prompt engineering and into the realm of multi-agent AI systems. It shows that by structuring interactions between AI agents (even if they are just instances of the same model), we can solve problems that are beyond the reach of a monolithic AI.

The key takeaways are:

  • Decomposition Works: Breaking down a complex problem into specialised sub-tasks is a powerful strategy for LLMs, just as it is for humans.
  • Feedback is Crucial: An iterative process of generation and critique (coding and testing) allows LLMs to self-correct and refine their outputs to a much higher standard.
  • Context is King: The discovery that “role-playing” enhances performance is a significant insight into how to best interact with and guide these models.

This paper provides a blueprint for creating more capable, autonomous AI systems that can handle complex, real-world software development tasks with minimal human intervention.

Press enter or click to view image in full size
Fig. 1. An example of role-playing. Through role-playing, LLM transforms into an expert within a specific
domain, delivering a professional-perspective response to the same requirement.
Press enter or click to view image in full size
Fig. 2. Self-collaboration framework for code generation and its instance.
Press enter or click to view image in full size
Fig. 3. An example of role instruction for coder in the instance of self-collaboration framework.
Press enter or click to view image in full size
Table 1. Comparison of self-collaboration and baselines, where the green highlights indicate the improvements
in comparison to ChatGPT (GPT-3.5).
Press enter or click to view image in full size
Table 2. The performance of self-collaboration code generation on APPS, where the green highlights indicate
the improvements in comparison to ChatGPT (GPT-3.5).
Press enter or click to view image in full size
Table 3. The performance of self-collaboration code generation on CoderEval.
Press enter or click to view image in full size
Table 4. Effectiveness of ChatGPT roles in self-collaboration code generation, where the green highlights
indicate the improvements in comparison to Coder.
Press enter or click to view image in full size
Fig. 4. Self-collaboration capacities of different LLMs.
Press enter or click to view image in full size
Table 5. The performance of self-collaboration code generation with GPT-4. The result in brackets is reported
on the GPT-4 technical report [40].
Press enter or click to view image in full size
Fig. 5. Self-collaboration capacities of CodeLlama series.
Table 6. The effect of maximum interaction (MI) for self-collaboration code generation
Press enter or click to view image in full size
Fig. 6. Error Analysis for Self-collaboration.
Press enter or click to view image in full size
Fig. 7. Performance and Cost of Self-collaboration Compared to Baselines.
Press enter or click to view image in full size
Fig. 8. Case study on HumanEval benchmark.
Press enter or click to view image in full size
Fig. 9. Case study on complex tasks in real-world scenarios. Red markers are added to denote specific objects.
Press enter or click to view image in full size
Fig. 10. Case study on complex tasks in real-world scenarios.

12. RLEF: Grounding code LLMs in execution feedback with reinforcement learning

This paper introduces Reinforcement Learning from Execution Feedback (RLEF), a method to train Large Language Models (LLMs) to iteratively improve their code based on feedback from a compiler or test suite. It addresses a critical weakness in modern code-generating LLMs: while they can generate impressive code in one shot, they struggle to use error messages to systematically debug and refine their own output.

The Core Problem: LLMs Are Bad Debuggers

When an LLM generates code that fails, the standard approach is often to just ask it for a completely new solution. This is called “independent sampling.” Research has shown that this brute-force method is frequently more effective than trying to feed the error message back to the model for a fix. This is inefficient and unlike how human developers work.

The authors posit that for an LLM to act as a true “agent,” it must be able to:

  • Understand the user’s intent.
  • Ground its actions in feedback from the environment to correct its mistakes and achieve the desired goal.

RLEF is designed to teach this second, crucial skill.

How RLEF Works: Turning Coding into a Reinforcement Learning Game

The authors frame the task of writing correct code as a multi-turn conversational game, which is a perfect setup for Reinforcement Learning (RL).

1. The “Game” Environment: Iterative Code Synthesis

  • Turn 1: The LLM is given a problem description (e.g., a competitive programming challenge) and generates a Python solution.
  • Execution: The code is run against a set of public tests (example test cases visible to the model).
  • Feedback:
  • If the code passes all public tests, the “episode” ends, and the solution is submitted for final evaluation against a hidden private test set.
  • If the code fails, the error message (e.g., Timeout, Wrong Answer, Exception) is formatted into a natural language response and added to the conversation history.
  • Subsequent Turns: The LLM is prompted again with the entire history — the original problem, its previous failed code, and the specific execution feedback — and asked to “Give it another try.”
  • This loop continues until the code passes the public tests or a turn limit (e.g., 3 attempts) is reached.

This mimics a developer working on a problem. They write code, run it against local examples (public tests), and if it fails, they read the error message and debug. Once it works locally, they submit it for final grading or to the CI/CD pipeline (private tests). Separating public and private tests prevents the model from “cheating” by simply hardcoding the answers to the examples it sees.

2. The Training Method: Reinforcement Learning

This iterative process is formalised as a Markov Decision Process (MDP) and optimised using Proximal Policy Optimisation (PPO), a standard algorithm for fine-tuning LLMs.

  • Policy: The LLM itself.
  • Action: Generating a complete block of code in a turn.
  • Observation: The conversation history up to that point.
  • Reward: A simple, sparse reward is given only at the end of the episode:
  • +1: If the final solution passes all private tests.
  • -1: If the final solution fails the private tests.
  • -0.2 (penalty): A small penalty is given for intermediate turns that produce syntactically invalid code.
  • A KL-divergence penalty is used to ensure the RLEF-trained model doesn’t stray too far from the original, capable base model.

Key Results and Findings

The authors trained Llama 3.1 8B and 70B models on the difficult CodeContests benchmark. The results are striking.

1. State-of-the-Art Performance with Unprecedented Efficiency

  • The RLEF-trained Llama 3.1 70B model achieves a 40.1% solve rate on the test set. This is a new state-of-the-art, significantly outperforming the previous best, AlphaCodium (using GPT-4), which scored 29%.
  • Crucially, RLEF achieved this with an order of magnitude fewer samples. The RLEF model used just 1–3 attempts per problem, whereas AlphaCodium required a complex “flow engineering” pipeline that used up to 100 samples.
  • The smaller RLEF 8B model also outperformed much larger models like AlphaCode 9B while using a tiny fraction of the sample budget (3 vs. 1,000).

2. RLEF Unlocks True Iterative Improvement

The paper provides strong evidence that the model is genuinely learning to use feedback, not just generating more diverse solutions.

  • Before RLEF: For base models (including GPT-4o), attempting to fix code in multiple turns was often worse than generating independent samples.
  • After RLEF: The multi-turn, iterative approach becomes significantly better than independent sampling.
  • The Random Feedback Ablation: This is a key experiment. When the RLEF-trained model was given random, irrelevant error messages during inference, its ability to fix code plummeted. This proves the model is grounding its changes in the specific content of the feedback.
  • Behavioural Analysis: RLEF models make fewer errors on their first attempt, are far more likely to fix an error in the next turn, and make more substantial code edits compared to base models, which tend to get stuck repeating their mistakes.

3. Generalisation and Practicality

  • The debugging skill learned on CodeContests generalised to other popular benchmarks like HumanEval+ and MBPP+, even though they have different problem styles and feedback formats.
  • The method trades complex, hand-crafted prompt engineering and agentic scaffolding for a more robust, end-to-end fine-tuning process.

Why This Matters for AI Professionals

  • A Step Towards Autonomous Agents: RLEF is a concrete demonstration of how to give LLMs the crucial ability to learn from their environment’s feedback, a cornerstone of building more autonomous and reliable AI agents.
  • Efficiency is King: By making models better debuggers, RLEF dramatically reduces the number of inference calls needed to solve a problem. This makes using powerful LLMs for coding tasks more computationally and financially viable.
  • Beyond Brute-Force Sampling: This work moves the field away from simply generating a massive number of solutions and hoping one works. It shifts the paradigm towards a more intelligent, human-like process of iterative refinement.
  • A Powerful Fine-Tuning Recipe: RLEF provides a clear and effective recipe for domain-specific adaptation. For any task where automated feedback is available (e.g., passing unit tests, API call success, compiler errors), this method can be used to significantly improve model performance and reliability.
Press enter or click to view image in full size
Figure 1: Solve rates of Llama 3.1 Models after RLEF training on CodeContests, compared to
previously reported results across sampling budgets (log scale).
Press enter or click to view image in full size
Figure 2: Left: Overview of reinforcement learning with execution feedback (RLEF). The LLM
is repeatedly prompted to implement code according to a problem description. Each attempt is
evaluated on a public test set; upon failure, feedback is inserted into the conversation. If public tests
are passing, or a specified turn limit is reached, execution on additional, private tests determines the
reward. The model is then updated with PPO. Right: Example dialog with two model responses.
Execution feedback hints at an inefficient first solution, to which the model responds to utilizing a
cache. The code passing the public test sets will be evaluated on the full test set.
Press enter or click to view image in full size
Table 1: Results on CodeContests of our initial and RLEF-trained models compared to prior work.
The sample budget k in n@k refers to the number of LLM responses, e.g., 1@3 for our results
corresponds to a single rollout with up to three model responses. Best results per sample budget (up
to 10, up to 100) in bold. The 70B model obtains state-of-the-art results after RLEF, and signifi-
cantly outperforms AlphaCodium and MapCoder generally, and on the test set with a fraction of the
samples. The RLEF-trained 8B model outperforms AlphaCodium with 100 samples and MapCoder
(gpt-3.5-turbo) with 3 samples.
Press enter or click to view image in full size
Table 2: 1@3 solve rates in single-turn (ST) and multi-turn (MT) setups for base and RLEF models.
On CodeContests, iterative code generation yields modest gains at best and drops in performance at
worst, unless RLEF training is employed. Improvements from RLEF on CodeContests in the multi-
turn setting carry over to HumanEval+ and MBPP+, which require a slightly different execution
feedback formatting. Solve rates estimated on 20 rollouts per problem, temperature 0.2.
Press enter or click to view image in full size
Figure 3: Behavior analysis of initial and RLEF-trained models with respect to public test results,
for 8B (top) and 70B (bottom) models. Within 20 rollouts per problem (5640 in total) we count
errors in the initial solution (turn 1); errors turned into correct code in turn 2 and 3; code changes
across successive solutions according to the chrF metric. RLEF-trained models make fewer errors
initially, can fix errors more reliably and perform larger code edits; initial models frequently repeat
previous solutions. With random execution feedback, error recovery is severely impaired.
Press enter or click to view image in full size
Figure 4: (a) Pass@1 and pass@10 across turn limits with RLEF-trained models, providing either
true or random execution feedback (temperature 0.2). With random feedback pass@1 is reduced
while pass@10 suffers only slightly, indicating that programs can be repaired less consistently. (b)
Impact of turn limits on 10@k solve rates per sample budget (top: 8B model, bottom: 70B model)
with temperature 1.0. With RLEF, iterative code generation can leverage up to 5 turns to achieve
compute-optimal performance.
Press enter or click to view image in full size
Table 3: 1@3 solve rates starting from Llama 3.1 models, temperature 0.2. (a) Comparison of
different methods for acquiring the iterative code synthesis capabilities. RLEF is the most effec-
tive training method, followed by supervised fine-tuning (SFT). We find few-shot prompting to be
detrimental to Instruct models. (b) Conventional single-turn (ST) compared to our multi-turn (MT)
training with our RL loop. MT training yields larger improvements compared to ST, and improve-
ments carrying over to multi-turn over single-turn inference is restricted to the 70B model.

13. MapCoder: Multi-agent code generation for competitive problem solving

This paper introduces MapCoder, a sophisticated framework that uses multiple LLM “agents” to solve complex coding problems. Instead of training a new model, MapCoder orchestrates a series of prompts to a general-purpose LLM (like GPT-4), guiding it through a workflow that mimics how an expert human programmer tackles a difficult task. The core idea is to break down the complex act of programming into four distinct cognitive stages: recalling similar problems, planning a solution, writing the code, and debugging.

The Core Problem: Simple Prompting is Not Enough

While LLMs are powerful, simply giving them a complex programming problem and asking for the code (direct prompting) often fails. Methods like Chain-of-Thought improve reasoning but are still limited. More advanced techniques that involve self-correction often rely on the LLM generating its own unit tests, which is a major pitfall — if the LLM misunderstands the problem, it will generate incorrect tests and “fix” the code to be wrong in a new way.

MapCoder is designed to create a more robust, structured, and intelligent problem-solving pipeline that avoids these common failures.

How MapCoder Works: The Four-Agent Pipeline

MapCoder structures the code generation process into a sequence of four specialised LLM agents, each with a specific role.

1. Retrieval Agent: Remembering Past Solutions

  • Job: To find relevant examples that can inform the solution to the current problem.
  • Method: Instead of using an external database, this agent prompts the LLM to generate its own relevant examples. It’s asked to “recall k relevant and distinct problems” and, for each one, provide the problem description, a step-by-step plan, and the final code. This “self-retrieval” primes the LLM with useful patterns and algorithms.

2. Planning Agent: Creating a Blueprint

  • Job: To create multiple high-level plans for solving the original problem.
  • Method: Using the examples from the Retrieval Agent as a guide (few-shot prompting), this agent generates several potential plans. Crucially, it is also prompted to output a confidence score (0–100) for each plan, indicating how likely it is to succeed. This ranking is the key to the entire workflow.

3. Coding Agent: Translating the Plan to Code

  • Job: To write the actual code based on a specific plan.
  • Method: This is a straightforward translation step. The agent takes the original problem description and the highest-confidence plan from the Planning Agent and generates the corresponding code.

4. Debugging Agent: Fixing the Bugs

  • Job: To iteratively fix the generated code if it fails.
  • Method: This agent takes the faulty code and the error logs from running it against the provided sample I/O tests. A key design choice is that MapCoder does not generate new test cases, avoiding the risk of self-deception. It is also given the original plan that the code was based on, acting as a “north star” to guide the debugging process and prevent the code from straying from the intended logic.

The “Dynamic Agent Traversal” Workflow

These agents don’t just run once in a line. They operate in an intelligent loop:

  • Start with the list of plans, sorted by confidence score.
  • Take the highest-scoring plan and give it to the Coding Agent.
  • Test the resulting code against the sample I/O. If it passes, the process is complete.
  • If it fails, the code is passed to the Debugging Agent, which tries to fix it up to a set number of times (t attempts).
  • If the Debugging Agent succeeds, the process is complete.
  • If the Debugging Agent fails after t attempts, that entire plan is considered a dead end. The framework discards it and moves to the next-highest-scoring plan, repeating the process from step 2.

Analogy: This is exactly like a good programmer. They start with what they believe is the best approach. They write the code and try to debug it. If they get stuck and can’t fix it, they don’t keep banging their head against the wall — they backtrack, discard that approach, and try their second-best idea.

Key Results and Findings

MapCoder was evaluated on eight challenging benchmarks and set new state-of-the-art results on several of them.

  • Dominant Performance: MapCoder significantly outperformed all other prompting-based methods (Direct, CoT, Reflexion) across the board. On the difficult CodeContests benchmark, it achieved a 28.5% pass@1 with GPT-4.
  • Unprecedented Efficiency: This is the most impressive result. MapCoder’s pass@1 (one attempt) score of 28.5% on CodeContests is nearly identical to the previous state-of-the-art, AlphaCodium, which achieved a 29% pass@5 (five attempts). This means MapCoder is finding the correct solution on the first try as often as AlphaCodium does in five, representing a massive gain in efficiency.
  • Robust and Generalizable: The framework proved effective across different LLMs (GPT-3.5, GPT-4, Gemini), various programming languages, and all levels of problem difficulty.
  • Each Agent is Crucial: Ablation studies showed that removing any of the four agents hurt performance, with the Debugging and Planning agents being the most critical. This validates the framework’s design.

Why This Matters for AI Professionals

  • The Power of “Flow Engineering”: MapCoder is a prime example of “flow engineering” or “agentic scaffolding,” demonstrating that how you structure the interaction with an LLM is as important as the model itself. A well-designed workflow can unlock performance far beyond simple prompting.
  • More Reliable Self-Correction: By deliberately avoiding self-generated test cases and grounding the debugging process in the original plan, MapCoder creates a more robust and less error-prone self-correction loop than previous methods like Reflexion.
  • Human-Aligned Reasoning: The multi-agent approach is powerful because it mirrors a proven, effective human problem-solving process. Decomposing a complex task into specialised sub-tasks is a fundamental principle of both software engineering and AI.
  • High Cost, High Reward: The paper’s limitation section is honest about the trade-off: this method requires significantly more API calls and tokens than direct prompting. However, for high-stakes, complex problems, the dramatic increase in accuracy and reliability can easily justify the cost.
Press enter or click to view image in full size
Figure 1: Overview of MapCoder (top). It starts with a retrieval agent that generates relevant examples itself, followed by planning, coding, and iterative debugging agents. Our dynamic traversal (bottom) considers the confidence of the generated plans as their reward scores and leverages them to guide the code generation accordingly.
Table 1: Features in code generation prompt techniques.
Figure 2: Prompt for Planning Agent.
Figure 3: Prompt for Debugging Agent.
Press enter or click to view image in full size
Figure 4: Example problem and solution generation using Direct, CoT, Reflexion, and MapCoder prompts.
MapCoder explores high-utility plans first and uniquely features a plan-derived debugging for enhanced bug
fixing.
Press enter or click to view image in full size
Table 2: Pass@1 results for different approaches. The results of the yellow and blue colored cells are obtained from Jiang et al. (2023b) and Shinn et al. (2023), respectively. The results of the Self-collaboration Dong et al. (2023b) paper are collected from their paper. The green texts indicate the state-of-the-art results, and the red text is gain over Direct Prompting approach.
Press enter or click to view image in full size
Figure 5: The number of correct answers wrt algorithm types (tags) and difficulty levels (xCodeEval dataset).
Table 3: Pass@5 results on CodeContest dataset. AlphCodium result are from Ridnik et al. (2024). The green cells indicate the SoTA and the red text indicates improvement w.r.t Direct approach.
Figure 6: Performance vs problem types (APPS).
Table 4: Pass@1 results with using Gemini Pro. The
red text is gain over Direct Prompting approach.
Table 5: Pass@1 results with using Mistral-7B-instruct.
The red text is gain over Direct Prompting approach.
Figure 7: The number of correct answers wrt different
programming languages (xCodeEval dataset).
Table 6: Pass@1 results for different versions of
MapCoder (by using ChatGPT on HumanEval dataset).
Table 7: Pass@1 results by varying k and t
Press enter or click to view image in full size
Table 8: Average number of API calls, thousands of tokens used, required time in minutes to get the API response.

14. Self-planning code generation with large language models

Ask an LLM like GPT-4 to write a simple function, and it delivers perfection. But give it a slightly more complex task, one with multiple constraints or step,s and it can fall apart, missing key requirements or producing buggy, illogical code.

The paper from researchers at Peking University, “Self-planning Code Generation with Large Language Models,” tackles this problem head-on. Their solution is brilliantly simple and mirrors how expert human developers work: make the model create a plan before it writes a single line of code.

This “self-planning” approach doesn’t just work; it delivers a massive performance boost, improving accuracy by up to 25.4% over direct generation. Let’s break down how.

The Problem: LLMs Don’t “Think,” They “Predict”

When you give an LLM a complex prompt like, “Write a function prime_fib(n) that returns the n-th number that is both a Fibonacci number and a prime number,” it tries to generate the most probable sequence of code tokens. It might latch onto “Fibonacci” and write a recursive function, or see “prime” and start checking for divisibility, but it struggles tsynthesiseze these two distinct concepts correctly.

The paper shows a great example where a powerful model (code-davinci-002) fails this exact task. It ends up hardcoding the first few correct answers and then writing a nonsensical recursive call, completely failing to check for primality.

Humans don’t work this way. A developer would mentally decompose the problem:

  • Okay, I need a way to check if a number is prime. I’ll write a helper function for that.
  • I need to generate Fibonacci numbers in sequence.
  • I’ll loop through the Fibonacci numbers, use my helper function to check for primality, and keep a counter.
  • Once I’ve found n prime Fibonacci numbers, I’ll return the last one.

The paper’s core idea is to force the LLM to follow this same logical, decomposed process.

The Solution: A Two-Phase Approach

The authors propose a simple and effective two-phase method:

Phase 1: Planning
First, they ask the LLM to generate a high-level plan to solve the problem. They achieve this using few-shot prompting. They show the model a handful of examples where a complex user request (the “intent”) is followed by a simple, numbered plan.

  • User Intent: “Find the n-th number that is a Fibonacci number and also prime.”
  • LLM’s Generated Plan:
  • Create a function to check if a number is prime.
  • Generate a Fibonacci sequence.
  • Check if each number in the Fibonacci sequence is prime, and if so, decrement the counter.
  • If the counter is 0, return the Fibonacci number.

The key is that the plan is concise and high-level. It focuses on what to do, not the nitty-gritty of how to do it.

Phase 2: Implementation
Next, they take the original user intent and append the newly generated plan to it. This combined text is fed back into the LLM, which is now tasked with generating the final code.

[Original Intent] + [Generated Plan] -> LLM -> [Final Code]

Guided by this step-by-step plan, the model now generates far more structured and correct code. In the prime_fib example, it successfully creates an is_prime helper function and uses it within a loop that generates Fibonacci numbers, perfectly matching the plan and solving the problem.

Why is this Better than Chain-of-Thought (CoT)?

This is a crucial point for AI experts. You might think this sounds like Chain-of-Thought prompting, but the authors draw a sharp distinction.

  • Chain-of-Thought (CoT) for code is like generating detailed code comments. It describes the solution steps as they’re being reasoned out. The paper argues that generating a good CoT is almost as hard as generating the code itself.
  • Self-Planning is about problem decomposition. It abstracts the problem into simpler, independent sub-tasks before implementation begins. It’s less about the “how” (like CoT) and more about the “what.”

Analogy: CoT is like a student showing their work on a complex calculus problem in one continuous flow. Self-planning is like the teacher first telling the student, “Okay, first solve for the derivative, then find the critical points, and finally test the boundaries.” The second approach is a more robust way to guide the reasoning process.

The Results: It Works, Especially on Hard Problems

The paper’s experiments show compelling results:

  • Massive Performance Boost: Self-planning achieves a relative improvement of up to 25.4% in Pass@1 compared to direct generation and up to 11.9% over a CoT-based approach.
  • It Thrives on Complexity: The harder the problem, the more self-planning helps. For the most complex third of the HumanEval benchmark, self-planning improved performance by a staggering 61.5% over direct generation.
  • Better, More Human-Like Code: A human evaluation found that the code generated via self-planning wasn’t just more correct — it was also significantly more readable and robust. The plans acted as a natural form of documentation, and the decomposition led to better handling of edge cases.
  • An Emergent Ability: Only the largest models (175B+ parameters) are good at generating the plans themselves. However, a plan generated by a large model can be used to significantly improve the code generation of a smaller model.

Key Takeaway

The “Self-planning” method is a powerful reminder that the path to more capable AI isn’t just about scaling up models. It’s about instilling better problem-solving processes. By forcing LLMs to mimic the structured, “decompose-then-code” strategy of human experts, we can unlock a new level of performance and reliability in AI-powered code generation. It’s a simple, elegant technique that any AI practitioner working with code generation should be aware of.

Press enter or click to view image in full size
Fig. 1. An example of code and CoT for code generation. The examples of vanilla COT are shown Figure 6 in
Appendix A.
Press enter or click to view image in full size
Fig. 2. Self-planning code generation is carried out in two phases (i.e., planning phase and implementation
phase): 1) In planning phase, LLM decomposes an intent into a set of easy-to-solve sub-problems and devises
a plan for executing the solution steps; 2) In implementation phase, LLM generates code following the intent
and plan, which assists self-planning code generation to be capable of handling more difficult problems
than direct code generation with LLM. Direct code generation uses the intent as input to the LLM and the
LLM generates the code directly. The LLM used here to demonstrate the example of code generation is
code-davinci-002.
Press enter or click to view image in full size
Table 1. Comparison of self-planning approaches and various baselines, and the number after ↑ denotes the
performance improvement achieved in LLM upon incorporating the corresponding approach, i.e., the relative
improvement compared to approach Direct.
Press enter or click to view image in full size
Table 2. Pass@k (%) of self-planning and other approaches on HumanEval benchmarks
Press enter or click to view image in full size
Table 3. Performance of Self-planning approach across various base LLMs on HumanEval benchmarks.
Press enter or click to view image in full size
Fig. 3. Illustrations of the variants including two-phase, one-phase, and multi-turn.
Press enter or click to view image in full size
Table 4. Comparison of self-planning and its variants on HumanEval benchmark.
Press enter or click to view image in full size
Table 5. Comparison of self-planning and other approaches on multilingual Datasets.
Press enter or click to view image in full size
Table 6. Pass@1 of self-planning on problems of varying complexity
Press enter or click to view image in full size
Fig. 4. Violin plot for human evaluation. The violin plot is a combination of a box line plot, which shows
the location of the quartiles (the upper edge of the box represents the upper quartile, the middle point is
the median, and the lower edge of the box is the lower quartile), and a kernel density plot, which shows the
density at any location.
Press enter or click to view image in full size
Fig. 5. Two real cases from HumanEval with self-planning, Code CoT (with self-planning format), and direct
code generation. The input, generated plan, and code are highlighted in green, red, and black, respectively.

15. From LLMs to LLM-based agents for software engineering: A survey of current challenges and future

FIG. 1: NUMBER OF PAPERS COLLECTED ON LLMS AND LLM-
BASED AGENTS FROM 2020 TO 2024.
FIG. 2: PAPER DISTRIBUTION
TABLE I: DISTRIBUTION OF SE TASKS
Press enter or click to view image in full size
TABLE II: COMPARISON BETWEEN OUR WORK AND THE EXIST-
ING WORK FOR LLM IN SE.
TABLE III: INCLUSION AND EXCLUSION CRITERIA FOR PAPER
SELECTION.
FIG. 3: VENUE DISTRIBUTION.
Press enter or click to view image in full size
TABLE IV: KEYWORDS FOR SOFTWARE ENGINEERING TOPICS
FIG. 4: ILLUSTRATION OF COMMON DATA AUGMENTATION
METHODS.
Press enter or click to view image in full size
TABLE V: CRITERIA FOR LLM-BASED AGENT
FIG. 5: ILLUSTRATION OF COMPARISON FRAMEWORK BETWEEN
LLM-BASED AGENT AND LLM IN USER STORY REFINEMENT.
FIG. 6: ILLUSTRATION OF COMPARISON FRAMEWORK BETWEEN
LLM-BASED AGENT AND LLM IN CODE GENERATION AND SOFT-
WARE DEVELOPMENT.
Press enter or click to view image in full size
Press enter or click to view image in full size
TABLE VII: EVALUATION METRICS IN CODE GENERATION AND SOFTWARE DEVELOPMENT
FIG. 7: EXPEL [40] FRAMEWORK WITH REFLEXION [59] IN EXPE-
RIENCE GATHERING.
FIG. 8: ILLUSTRATION OF COMPARISON FRAMEWORK BETWEEN
LLM-BASED AGENT [169] AND LLM [163] IN SOFTWARE TEST
GENERATION.
FIG. 9: ILLUSTRATION OF COMPARISON FRAMEWORK BETWEEN
LLM-BASED AGENT [51] AND LLM [183] IN SOFTWARE SECU-
RITY AND MAINTENANCE.
Press enter or click to view image in full size
FIG. 13: DISTRIBUTION OF BENCHMARKS.

16. CodeChain: Towards modular code generation through a chain of self-revisions with representative sub-modules

Researchers from Salesforce AI have developed an inference-time framework called CodeChain that dramatically improves an LLM’s ability to solve complex programming challenges. Instead of generating a single block of code, CodeChain forces the LLM to think modularly, identify the best building blocks from multiple attempts, and iteratively refine its solution, achieving state-of-the-art results with relative pass@1 improvements of 35% on APPS and 76% on CodeContests.

The Problem: LLMs Code Like Interns, Not Architects

Large Language Models (LLMs) like GPT-4 and WizardCoder are impressive at generating code for simple, self-contained problems (like those in HumanEval). However, when faced with complex, competitive programming tasks, their performance plummets.

The core reason, as the authors of CodeChain point out, is that LLMs tend to generate code as a monolithic block. They try to solve the entire problem in one go, much like a junior developer might write everything inside the main() function. This approach is brittle, hard to debug, and fails to manage complexity.

In contrast, an experienced software engineer instinctively breaks down a complex problem into smaller, manageable subtasks. They write modular functions, test them, and often reuse or adapt code snippets they’ve written before. This modular, iterative process is crucial for solving complex problems. CodeChain is designed to mimic this expert workflow.

The Solution: CodeChain’s Iterative Refinement Loop

CodeChain isn’t a new model; it’s a clever, multi-step inference framework that guides an existing LLM to produce better code. Here’s how it works, step-by-step:

Step 1: Force Modularity with Chain-of-Thought (CoT)

First, CodeChain modifies the initial prompt. It doesn’t just ask for a solution. Using Chain-of-Thought prompting, it instructs the LLM to:

  • First, outline the required code modules: List the function headers and docstrings that will be needed to solve the problem.
  • Then, implement each module and assemble them into a final solution.

This forces the model to break the problem down from the start, creating a modular code structure.

Step 2: Generate and Filter

The LLM generates a batch of N modular solutions (e.g., 20). These solutions are then run against any available public test cases. Any solution that fails is discarded, leaving a pool of potentially correct, modular programs.

Step 3: The “Chain of Self-Revisions” (The Secret Sauce)

This is the core innovation. Instead of just picking the best complete program, CodeChain learns from the collective wisdom of all successful attempts.

  • Extract Sub-Modules: It goes through all the programs that passed the public tests and pulls out every individual function (the sub-modules).
  • Cluster and Find Representatives: All these functions are then embedded into a vector space. Using K-means clustering, similar functions are grouped.
  • Analogy: Imagine you ask 20 chefs to bake a complex cake. You throw out the burnt ones. Then, you collect all their individual recipes for the “buttercream frosting” component. You’ll find they cluster into a few distinct types (e.g., Swiss meringue, Italian meringue, American buttercream).
  • Select Centroids: For each cluster, CodeChain identifies the “centroid” — the function that is most representative of that group.
  • Analogy: From the “Swiss meringue” cluster, you pick the one recipe that is the most “average” or quintessential example of that technique. These selected functions are your “golden snippets” or representative sub-modules.

Step 4: Revise and Repeat

In the next round, CodeChain augments the original CoT prompt with these newly found “golden snippets.” The new instruction is essentially:

“Solve this problem again, but this time, here are some highly effective helper functions that have worked well before. Try to reuse or adapt them in your new solution.”

The LLM then generates a new batch of 20 solutions, which are again filtered, clustered, and refined. This cycle repeats for several rounds (e.g., 5), with the solutions becoming progressively better.

The CodeChain framework: Generate modular code, extract functions from passing solutions, cluster them to find representatives, and feed those back in for the next revision round.

Key Findings and Results

The results are impressive and reveal some fascinating insights:

  • Massive Performance Boost: CodeChain lifts GPT-3.5’s pass@1 score on the challenging APPS benchmark from 22.3% to 30.2% (+35% relative) and on CodeContests from 5.8% to 10.3% (+76% relative). It works for open-source models like WizardCoder, too.
  • Modularity is a Double-Edged Sword: Interestingly, just using the initial CoT prompt to force modularity slightly hurts performance. The models aren’t pretrained to be perfect modular programmers. However, this initial structure is what enables the powerful self-revision process, leading to huge net gains.
  • Collective Wisdom Wins: The key is selecting representative sub-modules from all generated samples, rather than just trying to debug a single incorrect program. This leverages the LLM’s diversity to find robust components.
  • From Exploration to Exploitation: The authors found the best strategy was to start with a higher number of clusters (e.g., 5) and gradually decrease it in later rounds. This is analogous to an exploration phase (finding diverse ideas) followed by an exploitation phase (converging on the best ones).
  • Qualitatively Better Code: Using GPT-4 as an evaluator, the authors confirmed that code generated by CodeChain was rated as significantly more modular and reusable than code from standard prompting.

Why This Matters for AI Experts

CodeChain demonstrates that the path to better AI-driven code generation isn’t just about building bigger models. It’s also about designing smarter processes that guide them. By mimicking the structured, iterative, and reuse-oriented workflow of expert human programmers, CodeChain unlocks a new level of capability from existing LLMs.

Press enter or click to view image in full size
Figure 1: [Top] An example of a code generation task from CodeContests (Li et al., 2022) where the
problem description and public test cases are provided as inputs to the model. [Down] We illustrate
a typical problem-solving process in which a developer attempts to solve the problem iteratively,
revising and reusing parts of their previously developed codes until satisfied.
Press enter or click to view image in full size
Figure 2: An overview of CodeChain: a pretrained LLM is first instructed with chain-of-thought
prompting to generate a set of modularized solutions. Generated sub-modules are then extracted
from potentially correct solutions and grouped into different semantic clusters. The cluster centroids
are selected as representative sub-modules to condition the next self-revision round. The model is
instructed to reuse or adapt these modules into its revised solutions.
Press enter or click to view image in full size
Figure 3: An example of CoT prompting for code generation in CodeChain. The model is required to
first outline the solution in terms of sub-module signatures, each of which is intended for solving a
high-level sub-task in the final solution. The model is then required to implement these sub-modules
and combine them into a final solution (see Appendix F for a full version of the prompt).
Figure 4: An example of prompting to self-revise programs. The original in- struction from CoT prompting (Fig. 3) is combined with this instruction and the model is provided with a set of repre- sentative sub-modules («sub-modules») selected from previously generated samples. Please refer to Appendix F for a full version of the prompt.
Press enter or click to view image in full size
Table 1: APPS test results: results with † are for models finetuned on APPS training data
Press enter or click to view image in full size
Table 2: Comparison with Self-repair: following Olausson et al. (2023), we reported the results on
the same subset of 20 samples on APPS test split using GPT3.5 and GPT4 as base models. Please
refer to Table 5 for the full list of this test subset
Press enter or click to view image in full size
Table 3: APPS validation results by pass@1 (%): we tested CodeChain+GPT3.5 for 1 self-revision
round by 3 aspects: prompting, filtering by public tests, and sampling methods for revision (R:
random, C: centroid, P: whole programs, and M: sub-modules).
Press enter or click to view image in full size
Figure 5: APPS validation results with chain of self-revisions: we tested CodeChain+GPT3.5 for 5 self-revision rounds and reported pass@1 in each problem difficulty level. Using GPT3.5 as base model, we compared with related approaches, including Self-debug (with unit test (UT) feedback or explanation (expl)) (Chen et al., 2023c) and Reflexion (Shinn et al., 2023
Figure 6: we tested CodeChain+GPT3.5 on different setups of cluster numbers and reported the average relative pass@1 improvements from direct generation (round 0).
Press enter or click to view image in full size
Figure 7: APPS validation pass@1 results of WizardCoder-1B to 34B. The dotted lines are direct generation results.
Press enter or click to view image in full size
Figure 8: CodeContests results by pass@1 (%): we report the results of CodeChain using
WizardCoder-15B and GPT3.5 as base models. Left: test and validation results. Right: valida-
tion results over sequential self-revision rounds. The dotted lines are direct generation results.
Figure 9: Distribution of output samples (%) by code qualities in the APPS test subset. We obtained the qualitative scores by prompting GPT4 with specific evaluation instructions.

17. A unified debugging approach via LLM-based multi-agent synergy

As AI experts, we know that while LLMs are wizards at generating code snippets, they often stumble when faced with the messy, multi-step reality of debugging. Most tools focus on one piece of the puzzle, like locating a fault (Fault Localisation) or generating a patch (Automated Program Repair). They struggle with complex, real-world bugs that require a holistic understanding of a codebase.

A new paper from researchers at CUHK and UIUC introduces FixAgent, a multi-agent framework that tackles software debugging end-to-end. What makes it stand out is its core design philosophy: it doesn’t mimic a team of developers collaborating, but rather the internal cognitive process of a single, methodical developer.

Let’s break down what makes FixAgent so effective.

The Core Idea: A Lone Detective, Not a Noisy Committee

Previous multi-agent systems for software engineering often model a team: a product manager, a programmer, a tester, etc. They communicate in a complex, mesh-like network, much like a team in a meeting. The authors of FixAgent argue that this is inefficient for debugging, which is a more logical, step-by-step process.

Analogy: A Detective vs. a Panel of Experts

  • Old Approach (Team of Experts): Imagine a panel of experts trying to solve a crime. The forensics expert, the psychologist, and the lead detective all talk at once, sharing opinions. There’s a lot of back-and-forth, and communication can become redundant. This is how many multi-agent systems work.
  • FixAgent’s Approach (Lone Detective): Think of a single brilliant detective. They don’t argue with themselves. Instead, they follow a structured process:
  1. Analyse the crime scene (initial bug report).
  2. Form a quick, simple hypothesis.
  3. If that fails, they gather more evidence (deeper code analysis).
  4. They consult external resources (search online forums).
  5. They build a coherent case, with each step informing the next.

FixAgent is designed like a detective. Its agents work in a clear, unidirectional flow, like an assembly line of thought, passing refined information down the chain. This avoids communication overhead and keeps the process logical and efficient.

The Three Levels of Debugging: From Quick Fix to Deep Dive

FixAgent operates on a hierarchical, three-level architecture, inspired by cognitive models of how developers actually debug. It starts simple and only escalates the complexity (and cost) when necessary.

Level 1 (L1): The “Duh!” Fix
This is for simple, obvious bugs. It only uses two agents:

  • Locator: Pinpoints the suspicious line of code.
  • Fixer: Generates a patch for that line.
  • Analogy: This is the “turn it off and on again” of debugging. It’s the first, quickest thing you try, like checking for a typo or an obvious off-by-one error.

Level 2 (L2): The Focused Investigation
If L1 fails, the bug is likely more complex but probably contained within a single file. FixAgent activates more specialised agents:

  • Summarizer: Reads the entire buggy file and creates a natural language summary of what each function does. This helps the LLM understand the context without getting lost in the weeds.
  • Slicer: Narrows down the scope from the whole file to a smaller, suspicious snippet of code.
  • Locator, Fixer, and FixerPro: These agents then work on the sliced and summarised context. FixerPro acts as a code reviewer, analysing the patch from Fixer and either refining it or generating a new one if the first attempt failed.
  • Analogy: The quick fix didn’t work, so now you’re pulling out the user manual for the specific device that’s malfunctioning. You’re not reading the entire house’s wiring diagram yet — just focusing on the problematic component.

Level 3 (L3): The Full-Blown Investigation
If L2 fails, it’s a nasty bug that could involve cross-file dependencies or require external knowledge. L3 activates all seven agents:

  • RepoFocus: Analyses the entire repository to identify a list of all potentially related files.
  • Helper: Takes the bug description, forms a query, and uses a search engine to find relevant solutions on the web (like Stack Overflow), synthesising the results into a debugging guide.
  • The rest of the agents (Summarizer, Slicer, etc.) then work with this rich, cross-file context and the external knowledge from Helper.
  • Analogy: This is the “all hands on deck” phase. You’re now pulling out the full architectural blueprints of the house, checking the city’s power grid plans, and searching online forums for others who have had a similar, mysterious electrical issue.

The Results: Smashing the Benchmarks

This hierarchical, cognitive approach pays off. On the challenging Defects4J benchmark (a collection of real-world bugs from Java projects), FixAgent achieves new state-of-the-art results:

  • It correctly fixes 197 bugs, outperforming the previous best (ChatRepair) by over 25%.
  • Crucially, it achieves this without needing the ground-truth location of the bug, a major advantage over many baselines that are “cheating” by being told where to look.
  • It fixes all 40 bugs in the QuixBugs benchmark.
  • The framework is robust, significantly boosting the debugging performance of various underlying LLMs, from GPT-4o and Claude 3.5 Sonnet to open-source models like DeepSeekCoder.

Why This Matters for AI Experts

FixAgent offers a new paradigm for applying multi-agent systems to complex, logical tasks. The key takeaway is the shift in thinking: from modelling messy human collaboration to modelling streamlined individual cognition.

By breaking down the debugging process into a hierarchy of cognitive steps and equipping specialised agents with the right tools (static analysis, web search, etc.) at the right time, FixAgent demonstrates a more efficient and powerful way to automate one of software engineering’s most painful chores. It’s a compelling blueprint for how we can design AI systems that don’t just “talk” about a problem, but “think” their way through it.

Figure 1: An example of the “diff” patch, where “@@ -142,5 +142 5 @@” indicates line 142 is modified.
Press enter or click to view image in full size
Figure 2: Overvew of FixAgent. It starts with the simple L1 repair. If no plausible patch is generated, the L2 repair is triggered, and so is L3. Agents on the same level can communicate with others
Figure 3: A one-shot system prompt defines the role,
skills, actions, objective, and constraints of the agent.
Figure 4: After retrieving similar solutions, Helper even-
tually responds with an executable debug guide.
Press enter or click to view image in full size
Figure 5: L3 triggers all the seven agents to generate plausible patches
Press enter or click to view image in full size
Table 1: Comparison with baselines. #Corr and #Plau represent the number of bugs correctly and plausibly patched, respectively. The green cells indicate the best results. The blue cells indicate the results are obtained from sampled data, while other results are obtained on the whole dataset
Figure 6: Bug fix Venn diagram on Defects4J.
Press enter or click to view image in full size
Table 2: Ablation study on agents. #Plau represent the number of plausibly fixed bugs. The ! indicates the addition of a specific agent, and % denotes its absence. Expense denotes the average expense running once for a bug.
Table 3: Performance gains of FixAgent-Lite over dif-
ferent LLMs on 600 samples from ConDefects.
Table 4: Ablation study on external interactions

18. LEDEX: Training LLMs to better self-debug and explain code

You ask an LLM to write some code, and it spits out something that looks right but fails on a subtle edge case. Getting it to fix its own mistake, or “self-debug,” is the next frontier. While massive models like GPT-4 can do this reasonably well with clever prompting, smaller, open-source models often get stuck, repeating their mistakes or offering useless fixes.

A new paper from researchers at Purdue and AWS AI Labs, titled “LEDEX,” tackles this problem head-on. Instead of just relying on prompting, they’ve developed a full-fledged training framework to teach LLMs the crucial skills of self-debugging and, importantly, explaining their reasoning.

The Core Problem: Interns Need Training, Not Just Instructions

Think of an open-source code LLM as a talented but inexperienced intern. You can give it a task, and it will produce a first draft. If it’s broken, you can hand it back with the error message and say, “Fix this.” This is prompting, and like with an intern, the results can be hit or miss. They might not understand why it’s broken and just make another mistake.

What that intern really needs is a structured training program where they review examples of buggy code, study explanations of why the bug occurs, and learn the correct pattern for the fix. This is exactly what LEDEX provides for LLMs.

The LEDEX Framework: A Two-Phase Training Program

LEDEX is a scalable pipeline that turns a mediocre self-debugger into a competent one. It consists of two main stages: creating a high-quality curriculum and then using it to train the model.

Phase 1: Creating the Ultimate “Bug-Fixing” Textbook (Data Collection)

The biggest hurdle in training for self-debugging is the lack of data. You can’t just scrape GitHub for this; you need structured examples of (buggy code + error) -> (explanation + fix). LEDEX creates this data automatically:

  1. Generate Wrong Answers: It takes a large dataset of coding problems (like MBPP, APPS) and prompts an LLM to generate multiple solutions. Many of these will naturally be wrong.
  2. Generate Explanations and Fixes: For each incorrect solution, it prompts a more capable “teacher” model (like GPT-3.5 or a larger CodeLlama) with a specific instruction: “Please explain why the above code is wrong, and then fix it.”
  3. Verify the Fix: This is the crucial step. It runs the generated “fix” against unit tests. Only the explanation/fix pairs where the new code actually passes all the tests are kept. This filtering ensures the dataset is high-quality.

Analogy: Creating a Study Guide for Pilots
Imagine creating a training manual for flight simulator failures. You wouldn’t just record pilots crashing. You’d have a senior instructor (the teacher model) analyse each crash, write a detailed report on why the failure occurred (the explanation), and then demonstrate the correct recovery procedure (the fix). You’d then verify that this procedure actually works. LEDEX does this at scale for code.

Phase 2: The Training Regimen (SFT + RL)

Once the high-quality dataset is ready, LEDEX uses a two-step training process:

1. Supervised Fine-Tuning (SFT): The Classroom Learning
The model is first fine-tuned on the verified dataset. It learns the basic pattern: given a bug and an error, produce a coherent explanation followed by a correct patch. This provides a massive performance boost, giving the model a strong foundational ability to self-debug. In the intern analogy, this is like making them study the textbook of bug-fix examples until they’ve mastered the patterns.

2. Reinforcement Learning (RL): The Live Simulator
SFT is good, but RL can add a layer of nuance. LEDEX applies RL to further refine the model, but with a clever twist in its reward function. Instead of just a simple “pass/fail” reward, it gives separate, more granular feedback for both the explanation and the code.

Analogy: The Mentor’s Feedback
An SFT model is like an intern who has memorised the textbook. An RL model is like that intern getting live feedback from a mentor:

  • “Your code passed all the tests! That’s great. +5 points for the code.
  • “But your explanation of the bug was a bit shallow. It was technically correct but not very insightful. +2 points for the explanation.

LEDEX’s reward function does exactly this. It scores the refinement based on unit test success and similarity to a known good solution (CodeBLEU). Separately, it scores the explanation based on its semantic similarity to a high-quality, human-like explanation. This dual-reward system teaches the model not just to find a solution, but to produce robust solutions and helpful explanations.

The Results: From Failing Intern to Competent Developer

The results are impressive.

  • Massive SFT Boost: Supervised fine-tuning alone improved the pass@1 score (getting it right on the first try) by up to 15.92% on some benchmarks. The success rate of refining a bad solution shot up by over 10%.
  • RL Adds the Polish: Reinforcement learning added another 3.54% on pass@1 on top of SFT, proving particularly effective on harder benchmarks with more rigorous tests.
  • Model-Agnostic: The framework isn’t tied to OpenAI. They showed that a CodeLlama-7B model could be effectively trained using data generated by a larger CodeLlama-34B, or even data generated by itself (self-bootstrapping).
  • Humans Agree: In a human evaluation, developers rated the explanations from the LEDEX-trained models as significantly more correct and helpful than those from the base models.

Why This Matters for AI Experts

LEDEX provides a clear and scalable blueprint for a critical, yet underdeveloped, LLM capability: self-correction. It shows that we can move beyond mere prompt engineering and fundamentally enhance open-source models through a smart, automated data pipeline combined with a sophisticated SFT+RL training strategy. The novel, dual-reward function in the RL phase is a particularly insightful contribution, pushing models to excel at both coding and reasoning. This work paves the way for more reliable and transparent AI coding assistants that don’t just write code, but can help us understand why it fails.

Press enter or click to view image in full size
Figure 1: Pipeline of letting LLM generate code and self-debug.
Press enter or click to view image in full size
Figure 2: Overview of LEDEX.
Press enter or click to view image in full size
Figure 3: The CodeBLEU scores, unit test cases passing rate, sentiment similarity of wrong code explanations, final refinement code reward, and the explanation reward of the training data.
Press enter or click to view image in full size
Table 2: Pass@k of initial and refined solutions on four benchmarks. Each backbone’s best performance on every benchmark is bolded.
Press enter or click to view image in full size
Table 3: Overall pass@k on MBPP & HumanEval and MBPP+ & HumanEval+. Blue or red numbers show the improvement or deterioration: SFT is compared to prompting, and RL is compared to SFT.
Press enter or click to view image in full size
Table 4: Success refinement rates over four benchmarks. Blue numbers show the improvement.
Press enter or click to view image in full size
Figure 4: Pass@k of prompting, SFT, and RL CodeLlama-7B after three iterations of refinements.
Press enter or click to view image in full size
Table 5: Pass@k of CodeLlama-7B trained with CodeLlama-34B’s data.
Press enter or click to view image in full size
Table 6: Overall pass@k on MBPP & HumanEval and MBPP+ & HumanEval+, trained with CodeLlama-34B’s data. Blue numbers show the improvement.
Press enter or click to view image in full size
Table 7: Pass@k of CodeLlama-7B trained with self-bootstrapped data.
Press enter or click to view image in full size
Table 8: Overall pass@k on MBPP & HumanEval and MBPP+ & HumanEval+, trained with selfbootstrapped data. Blue or red numbers show the improvement or deterioration.
Press enter or click to view image in full size
Table 9: Average scores of code explanations rated by GPT-4 and developers. SC for StarCoder and CL for CodeLlama. “-” refers to not applied.

19. AgileCoder: A multi-agent framework for software development with natural language

This paper introduces AGILECODER, a multi-agent framework that makes large language models (LLMs) build software not like a rigid, top-down factory assembly line, but like a modern, collaborative Agile development team. By simulating Agile roles (Product Manager, Developer, Tester) and organising work into iterative “sprints,” the system can build more complex and functional software than previous methods. Its secret weapon is a “Dynamic Code Graph Generator” that gives agents a high-level map of the codebase, preventing them from getting lost in the details.

The Problem
Current multi-agent coding systems like ChatDev and MetaGPT follow a rigid “waterfall” model: they design everything at once, build everything at once, and then test at the end. This is unrealistic for complex software, which often requires changes and refinements. Furthermore, these systems struggle with “repository-level understanding.” As a project grows, they can’t fit the entire codebase into the LLM’s context window, making it hard to make intelligent changes or fix bugs that involve multiple files.

The Solution: An AI Agile Team
AGILECODER mimics the professional Agile software development lifecycle with a team of specialised agents:

  • The Team: Agents are assigned roles like Product Manager (breaks down requirements), Developer (writes code), Senior Developer (reviews code), and Tester (creates and runs tests).
  • The Workflow (Sprints): Instead of one massive effort, development is broken into short cycles called sprints. Each sprint has four phases:
  1. Planning: The team decides which small set of features to build next.
  2. Development: The Developer agent writes the code, and the Senior Developer reviews it for errors.
  3. Testing: The Tester agent writes and runs tests to find bugs in the new code.
  4. Review: The team assesses what was accomplished. If the project isn’t finished, they use the lessons learned to plan the next sprint.

This iterative loop allows the system to build incrementally, fix its own mistakes from previous sprints, and adapt to complexities as they arise.

Key Innovations & Analogies

1. The Agile Workflow: From Assembly Line to Custom Workshop

Previous systems worked like a classic car assembly line (the waterfall model). Each step must be completed perfectly before the next begins. If you find a flaw in the engine design after the car’s body is painted, it’s too late.

  • Analogy: AGILECODER works like a high-end custom auto shop building a prototype.
  • Sprint 1: The team builds the engine and chassis. They put it on a test rig to make sure it works perfectly.
  • Sprint 2: They add the body panels and electronics. They discover an issue where the wiring conflicts with the engine mount. Because they’re working iteratively, they can easily adjust the mount design before moving on.
  • Sprint 3: They add the interior and paint.
    This iterative “build-test-refine” cycle ensures the final car is robust and well-integrated, which is exactly how AGILECODER builds software.

2. Dynamic Code Graph Generator (DCGG): The GPS for Your Codebase

The biggest challenge in large software projects is understanding how all the files connect. If you change one file, what else might break? Loading the entire codebase into an LLM is like trying to read every book in a library just to find one sentence.

  • Analogy: The DCGG is like a dynamic subway map for the codebase.
  • Previous systems are like a tourist trying to navigate a city with a phonebook — they have all the information (every file), but no structure. To get from one point to another, they have to read through irrelevant pages.
  • AGILECODER’s DCGG builds a map where each file is a “station” and an import statement is a “subway line” connecting them. When a bug occurs in user_manager.py (the “Grand Central Station”), the agent doesn’t need to read the entire codebase. It just looks at its map and instantly sees which stations are directly connected (e.g., user.py, utils.py). This allows it to retrieve only the relevant context to fix the bug, saving tokens and improving accuracy.

The Results

  • SOTA Performance: AGILECODER significantly outperforms ChatDev and MetaGPT on standard benchmarks (HumanEval, MBPP), achieving a 7–8% higher pass@1 rate with GPT-3.5.
  • It Actually Works: On their new, more realistic ProjectDev benchmark, AGILECODER achieved an executability rate of 57.8%, far surpassing ChatDev (32.8%) and MetaGPT (a mere 7.7%). This means the software it produces is far more likely to run without errors.
  • Quality Over Speed: The Agile process requires more steps (and thus more tokens and time), but the ablation studies prove every part is crucial. Removing sprints, code review, testing, or the Code Graph all caused significant drops in performance.

Why It Matters

This work marks a significant step toward making autonomous AI developers behave more like their human counterparts. It demonstrates that the process by which agents collaborate is as important as the agents themselves. The Agile framework provides inherent error-correction and refinement capabilities. More critically, the Dynamic Code Graph Generator offers an elegant and effective solution to the pressing problem of context retrieval in repository-scale code generation, a key bottleneck for applying LLMs to real-world software engineering.

Press enter or click to view image in full size
Figure 1: An overview of AGILECODER
Press enter or click to view image in full size
Figure 2: Illustration of how Dynamic Code Graph Generator (DCGG) contributes to AGILECODER during the generation of a Python application.
Press enter or click to view image in full size
Table 1: Comparative results on HumanEval and MBPP datasets for various LLMs and LLM-based agents, highlighting performance enhancements achieved through the application of AGILECODER.
Press enter or click to view image in full size
Table 2: Results on ProjectDev
Press enter or click to view image in full size
Table 3: Ablation study on the incremental development, code review and writing test suite
Table 4: Ablation study on the impact of G. #ExceedingCL is the total number of Exceeding Context Length. In the case of the lack of G, we only consider tasks that do not encounter the Exceeding Context Length issue.

20. ChatDev: Communicative agents for software development

ChatDev creates a virtual software development company staffed entirely by LLM-powered agents. These agents (like CEO, Programmer, Tester) collaborate by “chatting” to take a single software idea (e.g., “make a Gomoku game”) and turn it into a working program. The paper’s core innovations are a structured workflow called the Chat Chain that mimics a waterfall development process, and a communication protocol called Communicative Dehallucination to prevent the agents from generating buggy or incomplete code. Essentially, it’s a framework for making specialised AI agents work together effectively on a complex, multi-stage task.

The Problem They’re Solving

Traditional AI approaches to software development are fragmented. You might have one model for turning requirements into design documents, another for code generation (like Copilot), and a third for finding bugs. These tools don’t talk to each other, leading to “technical inconsistencies” — like an architect, a builder, and an inspector who all use different blueprints and speak different languages. This results in an inefficient process that still requires significant human glue to hold it together.

ChatDev’s goal is to create a unified, autonomous process where language (both natural and programming) acts as the “unifying bridge” across all development phases.

How It Works: The Core Concepts

ChatDev is built on two main pillars that manage what agents should talk about and how they should talk.

1. The Chat Chain: An Assembly Line for Software

The Chat Chain is a structured, sequential workflow that guides the agents’ collaboration. It breaks down the complex task of “building software” into distinct phases, just like a traditional waterfall model.

  • Phases: Design -> Coding -> Testing.
  • Subtasks: Each phase has smaller steps. For example, Coding is split into Code Writing and Code Complete, and Testing is split into Code Review (static) and System Testing (dynamic).

In each subtask, exactly two agents collaborate: an Instructor and an Assistant. For instance:

  • In Design, the CEO (Instructor) tells the CTO (Assistant) the high-level requirements.
  • In Coding, the CTO (Instructor) gives the Programmer (Assistant) the design specifications to code.
  • In Code Review, the Reviewer (Instructor) points out potential bugs to the Programmer (Assistant).

Analogy: Think of it as a highly structured manufacturing assembly line. The initial idea (a block of raw metal) goes to the design station. Once approved, the design document moves to the coding station. The resulting code then moves to the review station, and so on. The output of one station becomes the input for the next, ensuring a smooth and orderly progression. This structure prevents agents from talking over each other or getting stuck in unproductive loops.

A key detail is their memory management. To handle LLM context limits, they use:

  • Short-term memory: The full conversation history within a single phase.
  • Long-term memory: Only the final solution from a phase (e.g., the finalised design document or the completed code) is passed to the next phase. This is like sending a meeting summary to the next team instead of the entire meeting transcript.

2. Communicative Dehallucination: “Are you sure you want me to do that?”

This is their clever mechanism for tackling “coding hallucinations” — the tendency for LLMs to generate incomplete, non-executable, or logically flawed code. Instead of the Assistant agent blindly trying to follow a potentially vague instruction, it’s encouraged to ask for clarification first.

A vanilla communication pattern is:

  1. Instructor: “Implement the login function.”
  2. Assistant: [Writes a block of code, possibly with bugs or missing imports]

The Communicative Dehallucination pattern is:

  1. Instructor: “Implement the login function.”
  2. Assistant (proactively seeking details): “To do that, I’ll need some specifics. Should I use a simple username/password system or OAuth? Which database table should store the user credentials?”
  3. Instructor: “Use a simple username/password system and store credentials in the users table.”
  4. Assistant: [Writes the correct, more specific code]

Analogy: This is the difference between a junior developer who just guesses what the boss wants versus one who asks clarifying questions before writing a single line of code. This back-and-forth dialogue loop, where the assistant temporarily takes on an instructor-like role to request information, significantly reduces errors and ensures the final code is more robust and aligned with the requirements.

How They Proved It Works

Since standard code-gen metrics like pass@k don’t work for evaluating entire software projects, the authors created their own:

  • Completeness: Is the code free of “TODO” or placeholder snippets?
  • Executability: Does the software compile and run without errors?
  • Consistency: How semantically similar is the generated code to the initial text requirement? (Measured with cosine distance of embeddings).
  • Quality: A combined score (Completeness * Executability * Consistency).

Key Findings:

  1. ChatDev outperforms baselines: It significantly beats GPT-Engineer (a single-agent approach) and MetaGPT (a multi-agent framework with more rigid, human-defined instructions). This shows that breaking the problem down and allowing for dynamic, cooperative dialogue is highly effective.
  2. The framework’s components are crucial (Ablation Study):
  • Removing agent roles caused the biggest performance drop. Without a “GUI-focused programmer” role, the agent just defaults to a simple command-line interface. This proves that role-playing is essential for directing LLM behaviour.
  • Removing Communicative Dehallucination also significantly hurt performance, leading to more buggy and incomplete code.

3. Communication analysis reveals interesting patterns:

  • Natural language dominates the early Design phase, where agents discuss high-level concepts like UI/UX and target users.
  • Programming language (code snippets, error logs) dominates the Testing phase, where agents collaboratively debug.
  • The error logs (Figures 4 & 5) visually demonstrate the dehallucination process: over multiple turns of conversation, the number of errors like ModuleNotFoundError and Method Not Implemented consistently decreases until the code successfully compiles.

Why This is Important for AI Experts

  • Paradigm Shift for Multi-Agent Systems: ChatDev provides a blueprint for effective multi-agent collaboration. It shows that success isn’t just about having multiple agents, but about giving them clear roles, a structured workflow (Chat Chain), and a protocol for resolving ambiguity (Communicative Dehallucination).
  • Language as a Universal Interface: The work strongly supports the idea of language as a unifying medium for complex problem-solving. The same fundamental mechanism (dialogue) is used for high-level design in English and low-level debugging in Python.
  • Beyond Code Generation: While applied to software development, this framework of structured, role-based, dehallucinated communication could be adapted for other complex, multi-stage tasks like scientific research, business plan creation, or engineering design.

Limitations (The Reality Check)

The authors are realistic about the current state:

  • Prototype-level: The system is best for generating relatively simple applications or prototypes. It struggles with complex logic or requirements that aren’t highly detailed.
  • Costly: Multi-agent systems are token- and time-intensive compared to single-agent approaches.
  • Evaluation is hard: Automating the evaluation of entire software systems remains a major challenge. Their metrics are a good start, but don’t cover everything (e.g., UI quality, robustness, security).
Figure 1: ChatDev, a chat-powered software development framework, integrates LLM agents with various social roles, working autonomously to develop comprehensive solutions via multi-agent collaboration.
Press enter or click to view image in full size
Figure 2: Upon receiving a preliminary task requirement (e.g., “develop a Gomoku game”), these software agents engage in multi-turn communication and perform instruction-following along a chain-structured workflow, collaborating to execute a series of subtasks autonomously to craft a comprehensive solution.
Press enter or click to view image in full size
Table 1: Overall performance of the LLM-powered software development methods, encompassing both single-agent ( 🤖) and multi-agent (🤖🤖) paradigms. Performance metrics are averaged for all tasks. The top scores are in bold, with second-highest underlined. † indicates significant statistical differences (p≤0.05) between a baseline and ours.
Table 2: Pairwise evaluation results.
Table 3: Software statistics include Duration (time consumed), #Tokens (number of tokens used), #Files (number of code files generated), and #Lines (total lines of code across all files) in the software generation process.
Table 4: Ablation study on main components or mechanisms. ≤ x denotes halting the chat chain after the completion of the x phrase, and ⧹ denotes the removing operation. CDH denotes the communicative dehallucination mechanism.
Figure 3: The utterance distribution of agent communications throughout the entire development process.
Press enter or click to view image in full size
Figure 4: The chart demonstrates the distribution of suggestions made by a reviewer agent during a multi-round reviewing process, where each sector in the chart represents a different category of suggestion.
Press enter or click to view image in full size
Figure 5: The diagram illustrates the progression of iterations in a multi-round testing process, where each colored column represents a dialogue round, showcasing the evolution of the solution through successive stages of testing.

Reference

2023

[1] Chen, T., Liu, X., Wang, Z., Gu, J., Lou, J. G., & Chen, W. (2023). Teaching large language models to self-debug. arXiv.

[2] Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., … Schmidhuber, J. (2023). MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv. https://doi.org/10.48550/arXiv.2308.00352

[3] Huang, J., Liu, S., Li, G., Zhang, S., Liu, C., & Zhang, T. (2023). AgentCoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv.

[4] Jiang, S. (2023). SelfEvolve: A code evolution framework via large language models. arXiv.

[5] Luo, Z., Liu, C., & Sun, J. (2023). InstructCoder: Instruction tuning large language models for code editing. arXiv.

[6] Mu, F., Shi, L., Wang, S., Yu, Z., Zhang, B., Wang, C., … Wang, Q. (2023). ClarifyGPT: Empowering LLM-based code generation with intention clarification. arXiv. https://doi.org/10.48550/arXiv.2310.10996

[7] Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., … & Xiong, C. (2023). CodeGen: A conversational paradigm for program synthesis. ICLR.

[8] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., … & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv.

[9] Zhang, K., Wang, Y., Zhang, Y., Yu, G., & Wang, G. (2023). ToolCoder: Teach code generation models to use API search tools. arXiv.

2024

[10] Bouzenia, D., Filali, I. S., & Gallinucci, E. (2024). RepairAgent: An autonomous, LLM-based agent for program repair. arXiv.

[11] Chen, Y., Zhao, K., Wang, Y., Yang, M., Zhang, J., & Niu, X. (2024). Enhancing LLM agents for code generation with possibility and pass-rate prioritised experience replay. arXiv. https://doi.org/10.48550/arXiv.2410.12236

[12] Dong, Y., Zhang, Y., Wang, Y., Liu, X., Shi, Y., Zhou, Y., & Zhang, T. (2024). Self-collaboration code generation via ChatGPT. ACM TSE.

[13] Gehring, J., Zheng, K., Copet, J., Mella, V., Carbonneaux, Q., Cohen, T., & Synnaeve, G. (2024). RLEF: Grounding code LLMs in execution feedback with reinforcement learning. arXiv. https://doi.org/10.48550/arXiv.2410.02089

[14] Islam, M. A., Ali, M. E., & Parvez, M. R. (2024). MapCoder: Multi-agent code generation for competitive problem solving. arXiv. https://doi.org/10.48550/arXiv.2405.11403

[15] Jiang, N., Liu, J., Li, Y., Wang, Z., & Wang, Q. (2024). Self-planning code generation with large language models. ACM TOSEM.

[16] Jin, H., Huang, L., Cai, H., Yan, J., Li, B., & Chen, H. (2025). From LLMs to LLM-based agents for software engineering: A survey of current challenges and future. arXiv. https://doi.org/10.48550/arXiv.2408.02479

[17] Le, H., Nguyen, V. C., Van Nguyen, B., & Nguyen, A. (2024). CodeChain: Towards modular code generation through a chain of self-revisions with representative sub-modules. ICLR.

[18] Lee, C., Xia, C. S., Yang, L., Huang, J., Zhu, Z., Zhang, L., & Lyu, M. R. (2024). A unified debugging approach via LLM-based multi-agent synergy. arXiv. https://doi.org/10.48550/arXiv.2404.17153

[19] Li, K., Tang, J., Yu, H., & Zhou, Y. (2024). LEDEX: Training LLMs to better self-debug and explain code. arXiv.

[20] Nguyen, D. Q., Titov, D., Deriu, J., & Cieliebak, M. (2024). AgileCoder: A multi-agent framework for software development with natural language. arXiv.

[21] Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., … Sun, M. (2024). ChatDev: Communicative agents for software development. ACL.]

--

--

Cahit Barkin Ozer
Cahit Barkin Ozer

Written by Cahit Barkin Ozer

I learn about innovations in technology, especially generative AI, and share them with you. My YouTube Channel: https://www.youtube.com/@cbarkinozer

No responses yet