Agentic Code Generation Papers Part 2 (Last)
Continuing with the cutting-edge research landscape and discovering how AI is moving from a coding assistant to a collaborative teammate.
1. CodePori: Large-scale system for autonomous software development using multi-agent technology
If you’ve followed the world of autonomous agents, you’ve seen systems like AutoGPT or MetaGPT attempt to automate tasks. While impressive, they often struggle to move beyond simple scripts or well-defined, small-scale problems. A new paper from researchers at Tampere and Jyväskylä Universities introduces CodePori, a multi-agent system designed to tackle a much bigger challenge: autonomously developing large, complex software projects from a single natural language prompt.
The results are compelling: they claim 89% accuracy on the HumanEval benchmark and an 85% success rate in manually building and running 20 different real-world applications.
The main problem with using a single LLM for a large software project is complexity. An LLM might forget earlier requirements, lose context, or generate messy, monolithic code. CodePori’s solution is to mimic a human software development team by creating an “assembly line” of specialised AI agents.
Think of it like building a car. You don’t have one person doing everything. You have specialists for the engine, the chassis, the electronics, and quality control. CodePori does the same for software.
Meet the CodePori Team: An Analogy
The system consists of six distinct agents, each with a specific role. Here’s a simple analogy to a traditional software team:
- The Project Manager (Manager Agent): You provide this agent with a high-level project description, such as “Create a facial recognition system for attendance tracking.” Its job is to act as the architect, breaking down the project into smaller, logical modules (e.g., “User Interface Module,” “Face Detection Module,” “Data Logging Module”). Crucially, it enforces a constraint: each module should be a maximum of 200 lines of code, forcing modularity and preventing context overload for the other agents.
- The Pair Programmers (Dev-1 & Dev-2 Agents): These two agents work together on one module at a time. Dev-1 writes the initial code, and Dev-2 reviews, refines, and optimises it. They go back and forth in a few iterative rounds, just like two developers pair programming, to produce a robust piece of code for that specific module.
- The Code Review & QA Team (Finalised-1 & Finalised-2 Agents): Once the developers are done, the code goes to this pair of agents. Their sole focus is quality assurance. They review the code for bugs, ensure it meets production standards, and check for any placeholders or TODOs. They also iterate among themselves to polish the code until it’s ready for integration.
- The Senior Tech Lead (Verification Agent): This agent performs the final sign-off. It takes the polished code from the QA team, reviews it in the context of the entire project, and gives a final report on what, if anything, needs to be changed to ensure seamless integration and functionality.
This structured, hierarchical workflow is CodePori’s key innovation. It prevents the chaos of having multiple agents trying to do everything at once and mirrors the proven processes of human software engineering.
The Results: How Well Does It Actually Work?
The researchers tested CodePori in two main ways:
1. The Standard Benchmark (HumanEval):
On the popular HumanEval benchmark, which tests the ability to generate correct code for small, function-level problems, CodePori achieved an impressive 89% pass@1 score. This significantly outperforms other agent-based systems like MetaGPT (85.9%) and ChatDev (86.6%), and it blows past older models like Codex (48%) and AlphaCode (18%).
2. The “Real-World” Manual Test:
This is where it gets interesting. HumanEval is great, but it doesn’t test the ability to build a full application. The team gave CodePori 20 diverse project descriptions, ranging from simple games like Tic-Tac-Toe to complex applications like:
- A facial recognition attendance system (477 lines of code).
- A chat-based “PekkaGPT” application (580 lines).
- A presidential debate tool supporting nine candidates in Finnish (1180 lines!).
Out of 20 projects, 17 of them (85%) produced functional, running code. The paper is refreshingly honest that this wasn’t always a one-click process. For many projects, the researchers had to make minor manual adjustments, such as:
- Installing required libraries (pip install opencv-python).
- Updating file paths for datasets or pre-trained models.
- Creating empty CSV files for the program to write to.
This is a crucial detail: CodePori gets you 95% of the way there, generating the entire logic and structure, but a human is still needed for final setup and environment configuration.
The performance is also remarkable. A complex project like the 1180-line debate system was generated in 40 minutes for a cost of just $2.56 in API calls.
My Takeaway for AI Experts
- Structure is King: CodePori’s main contribution isn’t a better model, but a better process. By enforcing modularity and a structured workflow (Design -> Develop -> Review -> Verify), it effectively manages the complexity that derails other agent systems. This is a powerful lesson in meta-programming for multi-agent collaboration.
- Beyond Function Generation: This work is a clear step towards automating the creation of entire software systems, not just isolated functions. The success with multi-module applications over 1000+ lines of code is a notable achievement.
- The Human-in-the-Loop Isn’t Gone (Yet): The need for manual setup shows we’re not at full, hands-off autonomy. The current state is more like an incredibly powerful scaffolding tool or a junior development team that generates a near-complete project for a senior dev to finalise and deploy.
- Prompting as Architecture: The prompts provided to each agent (viewable in their GitHub repo) are highly detailed, defining roles, tasks, constraints, and communication protocols. It reinforces the idea that in the age of generative AI, high-level software architecture is increasingly expressed through natural language instructions.
CodePori provides a compelling glimpse into a future where AI agents don’t just write code snippets, but orchestrate the entire development lifecycle for large-scale applications. It’s a pragmatic and effective approach that makes autonomous software development feel a lot closer to reality.
2. SpecRover: Code intent extraction via LLMs
In the fast-moving world of autonomous software engineering, we’ve seen a parade of agents that can tackle GitHub issues. Tools like AutoCodeRover and SWE-agent have shown that LLMs can, with the right scaffolding, fix real-world bugs. But they often operate like a black box: you give them an issue, and they produce a patch. Whether that patch is correct, brittle, or just overfitting to the test case is often left for a human to figure out.
A new paper from the National University of Singapore, “SpecRover,” argues that the next leap forward isn’t just about making agents better at coding, but making them better at thinking. It introduces a system that focuses on extracting the developer’s intent — the “specification” — before it even tries to write a fix. The result is not only a more effective bug-fixer but one that provides evidence and confidence in its own work.
The Problem: Fixing a Bug You Don’t Understand
Imagine telling a junior developer, “The user validation is broken,” and they immediately start changing code. They might make the error message go away, but they could also accidentally disable all password checks. This is the risk of “test-driven fixing” without understanding the underlying requirements.
Many AI agents fall into this trap. They are optimised to make a test pass, which can lead to “overfitting” patches that are technically correct for the test but semantically wrong for the application. SpecRover’s core philosophy is that high-quality repair requires understanding the intent first.
The Solution: A Diagnostic Team of AI Agents
SpecRover builds on its predecessor, AutoCodeRover, but adds a sophisticated, multi-stage workflow for inferring and using specifications.
Think of it like an expert diagnostic team at a high-end garage, instead of a single mechanic with a wrench.
- The Reproducer Agent (The Test Driver): Given a GitHub issue, this agent’s first job is to write a test that reliably reproduces the bug. This is the first piece of the specification: a concrete example of the failure.
- The Context Retrieval Agent (The Senior Mechanic): This agent is the star of the show. It explores the codebase to find the buggy code. But it doesn’t just return code snippets. For each relevant function it finds, it generates a Function Summary.
- Analogy: This is like the senior mechanic looking at a faulty component and saying, “Okay, this is the fuel injector. Its job is to spray a precise amount of fuel into the cylinder. The bug report implies it’s currently spraying too much.” This summary — “the intended behaviour” — is a crucial, function-level specification that the LLM infers from the high-level issue.
- The Patching Agent (The Junior Mechanic): This agent takes the buggy code and the function summary and writes a patch. It’s no longer just guessing; it has a clear, natural-language goal for that specific function.
- The Reviewer Agent (The Shop Foreman): This is SpecRover’s most critical innovation. Once a patch is generated, it’s sent to this agent for code review. The Reviewer Agent looks at everything: the original issue, the reproducer test, the patch, and the test results on both the original and patched code. It then provides Reviewer Feedback, a natural language explanation of its verdict.
- It might say, “The patch is incorrect. You caught the exception, but the code later still tries to perform the faulty operation, causing the same error.” (This is from a real example in the paper).
- Or, it might even say, “The patch is correct, but the reproducer test you wrote is wrong.” This feedback loop allows the system to self-correct and iteratively improve the patch.
- The Selection Agent (The Final Decision-Maker): If multiple patch candidates are generated (e.g., after several retries), this agent selects the best one based on the issue description and explains its choice.
The Results: Higher Accuracy and, More Importantly, Higher Confidence
SpecRover isn’t just a conceptual improvement; the numbers back it up.
- State-of-the-Art Efficacy: On the full SWE-bench benchmark, SpecRover resolves 19.3% of issues, a massive 50% relative improvement over its predecessor, AutoCodeRover (12.4%) and topping other open-source agents. On the smaller SWE-bench Lite, it hits 31%.
- The Killer Metric: Precision. This is where SpecRover truly shines. Most agents have a precision equal to their overall success rate (e.g., ~27% for Agentless). This means that for every 100 patches they suggest, about 73 are wrong. SpecRover’s Reviewer Agent acts as a filter. When it accepts a patch, that patch has a 50% chance of being correct. This dramatically improves the signal-to-noise ratio, making it far more practical for a human developer to use. You’re reviewing fewer, higher-quality suggestions.
- Quality of Evidence: The generated “Function Summaries” were found to align with human-written Pull Request titles 71% of the time. The Reviewer Feedback is often high-quality enough to be used directly as a commit message, saving the developer even more time.
My Takeaway for AI Experts
- Specification is the New Frontier: SpecRover makes a powerful case that the next wave of progress in autonomous SE will come from better “understanding” rather than just better code generation. Inferring intent is key to avoiding shallow fixes.
- The Power of a Feedback Loop: The Reviewer Agent is a brilliant implementation of a self-critique mechanism. By creating an agent whose sole purpose is to review the work of other agents, the system becomes more robust and reliable. This is a design pattern we’ll likely see more of.
- From Tool to Collaborator: By generating evidence (summaries, reviews, reasons for selection), SpecRover moves beyond being a simple code-fixer. It acts more like a junior developer who submits a pull request with a clear description, making the human review process faster and more trustworthy.
- Precision > Raw Efficacy: For autonomous tools to be adopted, they must respect the developer’s time. A tool that is right 30% of the time is noisy. A tool that is right 50% of the time when it claims to be confident is a tool you’ll actually listen to. SpecRover’s focus on precision is a crucial step toward real-world usability.
SpecRover is a significant milestone, demonstrating that by endowing LLM agents with the ability to infer specifications and review their own work, we can build more effective, trustworthy, and collaborative AI for software engineering.
3. Code repair with LLMs gives an exploration-exploitation tradeoff
When you ask an LLM to write complex code, its first attempt is often buggy. A popular technique is “refinement”: you take the buggy code, show the LLM the errors (like failed test cases), and ask it to fix them. You can do this repeatedly. But this creates a huge, branching tree of possible programs. The big question is: how do you search this tree efficiently to find a correct program without wasting countless expensive API calls?
This paper, “Code Repair with LLMs gives an Exploration-Exploitation Tradeoff,” argues that simple search strategies like “always refine the best-looking code so far” (greedy) are suboptimal. Instead, they frame the problem as a classic exploration-exploitation tradeoff and solve it elegantly using a multi-armed bandit algorithm. Their method, called REx (REfine, Explore, Exploit), solves more problems with significantly fewer LLM calls.
The Problem: A Giant, Endless Tree of Code
Imagine you start with an initial prompt. The LLM gives you a few potential programs. None is perfect.
- Program A passes 80% of test cases.
- Program B passes 60% of test cases.
- Program C passes 0% of test cases.
Now what? You have a limited budget of LLM calls.
- Do you exploit your best option and keep trying to refine Program A? It seems close, but maybe it’s stuck in a local optimum with a fundamental flaw.
- Do you explore by trying to refine Program B? It’s not as good, but perhaps it has a more promising underlying structure. What about Program C? It’s probably a lost cause, but what if it’s just one simple fix away from being brilliant?
This is the core tradeoff. Every time you refine a program, you create a new “branch” in your search tree. Simple strategies like Breadth-First Search (try refining everything a little bit) or Greedy Search (only refine the current best) aren’t very smart about balancing this tradeoff.
The Solution: REx, the Smart Gambler
The authors frame this search as an arm-acquiring multi-armed bandit problem.
Analogy Time: The Slot Machine Casino
Imagine you’re in a casino with a bunch of slot machines (or “one-armed bandits”).
- Arm: Each unique program you could refine is a slot machine.
- Pulling an arm: Calling the LLM to refine that program is like pulling the lever on a machine.
- Reward: If the refinement produces a perfect program that passes all tests, you get a reward of 1 (you hit the jackpot!). Otherwise, the reward is 0.
- Arm-Acquiring: Every time you pull a lever, a new slot machine (a new program variant) might appear on the casino floor.
The goal is to find the jackpot in the fewest pulls. To do this, REx uses a clever algorithm called Thompson Sampling.
How Thompson Sampling Works Here
Instead of just tracking a simple score for each program, REx maintains a probabilistic belief about how likely each program is to lead to a solution. For each program p, it maintains a Beta distribution representing its belief about the “success probability” θ_p of refining that program.
This belief is shaped by two key factors:
- Heuristic Value (The Initial Bet): The belief starts biased by how good the program already looks. The paper uses a simple heuristic: the fraction of test cases passed (h(p)). A program that passes 80% of tests starts with a much more optimistic belief distribution (high expected θ_p) than one that passes only 20%. This is the exploitation part — it prioritises programs that are already doing well.
- Number of Failures (The Disappointment Factor): Every time REx refines a program and it fails to produce a solution, the belief distribution for that program is updated to be more pessimistic. The more you try to refine a program without success, the more its expected θ_p decays towards zero. This is the exploration part — it prevents the algorithm from getting stuck on a promising-but-ultimately-flawed program and encourages it to try other, less-refined options.
The REx Algorithm in a Nutshell:
At each step:
- For every program in your collection, sample a random value from its current Beta belief distribution.
- Pick the program that got the highest sampled value.
- Refine that program by calling the LLM.
- If the new program is a solution, you’re done!
- If not, add the new program to your collection (with its own initial belief) and update the belief of the program you just refined (making it slightly more pessimistic). Repeat.
This simple process naturally balances exploring programs that have been refined less often with exploiting those that have a high pass rate.
The Results: Does it Actually Work?
Yes, and remarkably well. The authors tested REx on three very different and difficult domains:
- Competition Programming (APPS): Hard algorithmic problems.
- Visual Reasoning (ARC): Synthesising programs to solve visual puzzles.
- Loop Invariant Synthesis: A formal verification task in software engineering.
Across all domains, REx demonstrated significant advantages:
- Higher Success Rate: It consistently solved more problems than greedy, breadth-first, and other baseline strategies within a fixed budget of LLM calls.
- Greater Efficiency: To solve the same problems, REx often required 1.5x to 5x fewer LLM calls. This is a massive cost and time-saving.
- Solves Harder Problems: REx managed to solve several difficult problems that were out of reach for the other methods, setting new state-of-the-art results on several benchmarks.
- More Robust: The method was less sensitive to hyperparameter tuning, making it more practical to use “out of the box.”
Key Takeaway for AI Experts
The way you structure the iterative interaction with an LLM for code generation is not a trivial detail — it’s a critical component of success. This paper shows that moving beyond simple greedy or exhaustive search strategies and adopting a principled, bandit-based approach like REx can lead to dramatic improvements in efficiency and capability. It’s a prime example of how combining classic reinforcement learning concepts with modern LLMs can unlock new levels of performance.
4. MatPlotAgent: Method and evaluation for LLM-based agentic scientific data visualisation
Creating precise, publication-quality scientific visualisations is a pain. You can’t just tell DALL-E to “plot my protein consumption data with K-Means clustering” because it’s an artistic tool, not a scientific one. You need code, usually Python with libraries like Matplotlib. While LLMs are great at writing code, they often make subtle mistakes: labels overlap, scales are wrong, or the plot type doesn’t quite match the data. Fixing these issues requires a tedious cycle of running the code, looking at the plot, tweaking the code, and repeating.
This paper, “MatPlotAgent,” introduces a clever LLM-based agent that automates this entire cycle. It doesn’t just write the code; it also looks at the generated image to see if it’s correct and then provides feedback to itself to fix the code. To properly test this, the authors also had to build a new benchmark, MatPlotBench, and an automatic evaluation system using GPT-4V.
The Problem: A Brilliant Coder Who is Blind
Imagine you have a brilliant assistant who is a master of Python and Matplotlib, but they’re blind.
You can give them a complex instruction: “Create a 3D waterfall plot of this frequency data, with a logarithmic pressure scale and specific axis labels.” They will write the code perfectly based on your words. But they can’t see the final output. They have no idea if the labels are overlapping, if the colours are indistinguishable, or if the points are too small to be visible. This is the state of most code-generating LLMs today — they operate on text and code, but are blind to the visual output of that code.
The Solution: MatPlotAgent’s Three-Part Brain
MatPlotAgent solves this by creating an agentic framework that mimics how a human expert works. It has three specialised modules:
1. The Planner (Query Expansion):
First, it takes your high-level request (e.g., “make a chord diagram of brand switching”) and breaks it down into a detailed, step-by-step technical plan. This is like a senior developer writing out a clear spec before coding, identifying the exact libraries, functions, and parameters needed. This prevents the LLM from misunderstanding the initial, often ambiguous, user query.
2. The Coder (Code Agent):
This is the workhorse. It takes the detailed plan and generates the Python code. Crucially, it includes a self-debugging loop. If the code crashes with a runtime error, the agent reads the error message and attempts to fix the bug, just like a real developer would.
3. The Critic (Visual Agent):
This is the game-changer and the paper’s core innovation. After the Coder successfully generates a plot, the Visual Agent — powered by a multi-modal model like GPT-4V or Gemini Pro Vision — looks at the image. It then compares the visual output against the original request and the detailed plan. It provides feedback in natural language, such as:
- “The point sizes are too small for good visibility. You should scale them up.”
- “The text annotations in the chart are jumbled and overlapping. Adjust their positions.”
- “The user requested a logarithmic scale for the y-axis, but this plot uses a linear scale. You need to fix it.”
This feedback is then sent back to the Code Agent, which refines the code and generates a new plot. This loop continues until a satisfactory result is achieved.
Analogy Time: The Artist’s Feedback Loop
A human programmer or data scientist doesn’t just write code and hope for the best. They run the code, look at the plot, and critique their own work: “Hmm, that legend is in the way,” or “This colour scheme isn’t clear.” MatPlotAgent builds this essential feedback loop directly into the AI system. The Code Agent is the “hands,” and the Visual Agent is the “eyes.”
You Can’t Improve What You Can’t Measure: MatPlotBench
To prove their agent was any good, the researchers faced a classic AI problem: no standard test existed for this complex task. So, they built one.
- MatPlotBench: A high-quality benchmark of 100 difficult visualisation tasks, each with a user query, raw data, and a human-verified ground-truth image. The problems range from standard bar charts to complex Sankey diagrams and 3D plots.
- Automatic Scorer: Manually grading 100 plots for every model is slow and expensive. So, they created an automatic evaluator that prompts GPT-4V to compare a model-generated image to the ground-truth image and assign a score from 0 to 100. They show that this automatic score has a very strong correlation (r > 0.8) with human judgment, making it a reliable tool for future research.
The Results: Seeing is Believing
The results are impressive.
- Significant Performance Boost: MatPlotAgent substantially improved the performance of every LLM tested, including GPT-4, GPT-3.5, and several open-source models like Magicoder and Deepseek-coder. For GPT-4, the score jumped from 48.86 to 61.16.
- The Power of Visual Feedback: An ablation study showed that disabling the Visual Agent caused a significant drop in performance. This confirms that the “secret sauce” is indeed the agent’s ability to see and critique its own visual output.
- Model-Agnostic: The framework works with different code LLMs (the “hands”) and different multi-modal LLMs (the “eyes”), making it a generalizable approach.
Key Takeaway for AI Experts
MatPlotAgent is a fantastic example of the next step in AI agents: creating closed-loop systems that span multiple modalities to solve complex, real-world tasks. By giving a code-generating agent “eyes” to see the result of its work, we move from blind instruction-following to a more robust, self-correcting system that acts much more like a human expert. This principle of “generate, perceive, and refine” is a powerful paradigm that can be applied far beyond just making charts.
5. XUAT-Copilot: Multi-agent collaborative system for automated user acceptance testing with a large language model
Researchers at WeChat Pay had a semi-automated system for app testing (called XUAT), but it still required humans to manually write test scripts, a slow, tedious, and error-prone process. To solve this, they built XUAT-Copilot, an LLM-powered system that automates script generation. Instead of using one monolithic LLM, they designed a collaborative team of three specialised AI agents that work together to understand test cases, interact with the app, and validate each step. This multi-agent approach crushed a single-agent baseline (88.5% vs 22.6% pass rate) and is already deployed in their production environment, saving significant developer time.
The Problem: The “Last Mile” of Test Automation is a Grind
User Acceptance Testing (UAT) is the final check before software goes live. It’s not about finding obscure bugs; it’s about confirming the app works as a real user would expect. For an app like WeChat Pay, with billions of users, this is non-negotiable.
Their existing XUAT system could automatically generate test cases from business requirements (e.g., “User binds a new bank card”). But then, a human tester had to take that instruction and manually write a script of low-level commands (using ADB: Android Debug Bridge) to execute it.
Analogy: Imagine you have a detailed recipe (the test case), but you still need a chef (the human tester) to painstakingly translate each step — “dice the onion,” “sauté for 5 minutes” — into the precise robotic arm movements required to actually cook the meal. This translation is the bottleneck.
Why is this translation so hard for an AI? The paper highlights four key challenges:
- Highly Condensed Instructions: A single step like “Submit Personal Information” might involve 5–10 actual actions: tap name field, enter text, tap ID field, enter number, scroll down, check a box, tap submit. An AI needs to unpack this.
- Context-Sensitive Actions: The instruction “Confirm Information” could mean clicking a single ‘OK’ button on one screen, but on another, it might mean scrolling through a terms-of-service agreement, ticking a checkbox, and then clicking ‘Confirm’. The AI has to understand the context of the current screen.
- Vast Parameter Space: WeChat Pay has hundreds of test parameters (test usernames, bank numbers, ID numbers, etc.). A human knows which one to use when, but an LLM could easily get confused.
- Step-by-Step Accuracy: UAT is not random exploration. You must follow the script exactly. If one step fails, the entire test fails. The system needs to be meticulous.
The Solution: A Multi-Agent “Pit Crew” for App Testing
Instead of building one “super-agent” to handle everything, the authors created a collaborative system of specialised agents. This is the core insight of the paper. A single agent gets overwhelmed by the long context (UI information, test case steps, parameter lists) and complex reasoning. A team can divide and conquer.
Analogy: Think of a Formula 1 pit crew. You don’t have one mechanic who does everything. You have a specialist for each tyre, one for the fuel, and a crew chief coordinating. XUAT-Copilot works the same way.
Here’s a look at the full system architecture:
A simplified view of the XUAT-Copilot workflow.
Let’s break down the team:
- Rewriting Module: The initial test cases are written in a clunky, templated format. This module acts as a translator, converting phrases like “System requests User to select a bank… result feedback from User is selecting bank” into a clean, actionable instruction for the LLM: Select ${bank_name}. It makes the task much clearer for the agents.
- Perception Module: This is the system’s eyes. It captures the app’s current screen in two ways: the XML-like View Hierarchy (the structured layout) and a screenshot. It filters out irrelevant information to keep the context concise for the LLM.
- The Operator (Operation Agent): This is the Crew Chief. It’s the core of the system. It looks at the rewritten instruction (e.g., “Enter payment password”) and the current screen information from the Perception Module. Its job is to decide the next action and select the right tool from a predefined “Skill Library” (e.g., click(resource_id), input_text(resource_id, text)).
- The Quartermaster (Parameter Selection Agent): This is a dedicated specialist. When the Operator decides on an action that needs data, like input_text, it calls on the Quartermaster. This agent is given the full list of hundreds of test parameters, and its only job is to pick the correct one (e.g., payment_password_123456). This offloads a difficult, memory-intensive task from the main Operator.
- The Inspector (Inspection Agent): This is the Quality Control agent. After the Operator executes an action (e.g., clicking the ‘Next’ button), the Inspector looks at the new screen and the instruction for the next step. It then decides if the previous step was completed. For example, if the action was “Submit Name and ID” and the new screen is “Set Password,” the Inspector reports “Success, goal accomplished,” and the system moves to the next step in the test case.
The system works in a loop for each step: the Operator acts, the Inspector checks. If the goal isn’t met, the Operator tries again, informed by its past actions. This includes a self-reflection mechanism, where if an action fails, the Operator is prompted to analyse the error and try a different approach — a simple but powerful form of learning.
A Clever Detail: Fusing Vision and Text for Hyperlinks
One fantastic, practical detail is how they handle clickable hyperlinks in text. The app’s View Hierarchy might register an entire paragraph as a single clickable element, but the user actually has to tap the small, blue, underlined text within it. Clicking the centre of the paragraph would fail.
To solve this, the Perception Module uses a computer vision pipeline:
- It takes a screenshot and filters for the specific colour of the hyperlink text.
- It uses an object detection model (SegLink++) to find the precise bounding box of just the hyperlink.
- It then updates the UI information given to the Operator with these correct coordinates, ensuring the click() action targets the right spot.
This is a perfect example of grounding an LLM’s reasoning with specialised tools to handle real-world messiness.
The Results: Teamwork Makes the Dream Work
The authors compared their multi-agent XUAT-Copilot against two simpler versions:
- Single Agent: One LLM trying to do everything (plan, select parameters, and inspect).
- Without Reflection: The multi-agent system, but without the ability to reflect on and correct its mistakes within a step.
The multi-agent architecture provides a staggering 4x improvement in pass rate over a single agent. This proves that task decomposition and specialised agents are crucial for tackling complex, interactive tasks. The self-reflection mechanism also provides a significant boost, adding nearly 7 percentage points to the final pass rate.
Why This Matters for AI Experts
- Multi-Agent is More than a Trend: This paper is a powerful, real-world validation of the multi-agent system paradigm. For complex tasks, breaking down the problem and assigning roles to specialised agents is far more effective than trying to build a single, monolithic “do-everything” model.
- The Power of Grounding: XUAT-Copilot isn’t just an LLM in a vacuum. It’s grounded in its environment through a robust Perception Module (its “eyes”) and a constrained Skill Library (its “hands”). This fusion of linguistic reasoning and environmental interaction is key to its success.
- From Lab to Production: This isn’t a theoretical experiment. The system has been “launched in the formal testing environment of WeChat Pay,” demonstrating that LLM-based agentic systems are ready to deliver tangible business value by automating high-skill, labour-intensive work. It’s a blueprint for applying similar agent-based solutions to other complex software workflows.
6. CYCLE: Learning to self-refine the code generation
We know code LLMs are great at writing a first draft of code. But when that draft is wrong — which it often is — they are notoriously bad at fixing it. Ask them to correct their own faulty code using test feedback, and they’ll often just spit out the same broken code again. This paper introduces CYCLE, a framework that tackles this problem head-on. Instead of just prompting, CYCLE continues training a pre-trained model on a specially curated dataset of its own mistakes, the resulting error messages, and the correct solutions. This teaches the model the crucial skill of self-refinement. The results are impressive: CYCLE boosts code generation success rates by up to 63.5% through iteration and allows smaller models to outperform models 3x their size in debugging tasks.
The Problem: The “Brilliant but Stubborn” AI Coder
Today’s code LLMs operate in what the authors call two modes:
- Acceleration Mode: You know what you want, you write a prompt, and the AI generates the code, speeding you up. This works reasonably well.
- Exploration Mode: The first attempt fails. You have an error message, but you’re not sure how to fix it. This is where you’d want the AI to act as a collaborative debugger.
Unfortunately, LLMs are terrible in exploration mode. When presented with their own faulty code and a clear error message, they often fail to understand the feedback and simply copy-paste their original, incorrect answer.
The Solution: Stop Prompting, Start Training
The core insight of CYCLE is that self-refinement isn’t just an emergent ability you can unlock with clever prompting; it’s a skill that needs to be explicitly taught. CYCLE is a three-phase framework designed to do just that.
You can’t teach a model to fix errors without good examples of errors to fix. Since existing datasets don’t have this structure, CYCLE creates its own.
- Initial Fine-Tuning: They start with a pre-trained model (like CodeGen or StarCoder) and first fine-tune it on a dataset of high-quality, correct code solutions. This isn’t to teach refinement yet; it’s to ensure that when the model does make mistakes later, they are “near-misses” and not just random, nonsensical code.
- Distilling Weaknesses: They then prompt this fine-tuned model with thousands of programming problems. They execute the generated code against test suites and collect every failure. Each data point becomes a triplet: (Faulty Code, Execution Feedback, Correct Solution). This becomes the training data for the next phase.
This is where the magic happens. The model is now trained on the data from Phase I. For each sample, the model is given a structured input that looks like this:
"""
<Problem Description in a docstring>
"""
# Here is a failed attempt:
<The model's own faulty code from Phase I>
# The execution of the failed attempt produced this error:
<The actual traceback and error message from the test suite>
# Here is the correct implementation:
<The model is trained to generate this part>This training forces the model to learn the connection between a specific type of error (IndexError), the code that caused it, and the pattern of the correct solution.
A Key Innovation: Past Generation Mask (PGM)
A naive model might learn a “shortcut” by just copying large chunks of the <faulty code> since it’s often very similar to the correct solution. To prevent this, CYCLE uses a Past Generation Mask (PGM). During training, it randomly masks out a small percentage (e.g., 5%) of the tokens in the faulty code.
Analogy: This is like giving a student their failed exam but redacting a few key parts of their wrong answer. It forces them to rederive the solution instead of just tweaking their original, flawed logic.
Once trained, the CYCLE model works like a real developer.
- First Attempt: It generates code for a new problem.
- Test: The system automatically runs the code against a test suite.
- Refine (if needed): If tests fail, the system automatically formats the prompt with the failed code and error feedback (just like in training) and feeds it back to the model.
- Repeat: This loop continues until the code passes the tests or a maximum number of attempts is reached.
The Results: Learning to Debug Pays Off, Big Time
The authors evaluated CYCLE across multiple model sizes and benchmarks (HumanEval, MBPP, APPS).
- Self-Refinement is a Superpower: Vanilla code LLMs barely improve when given feedback. In contrast, CYCLE models get progressively better with each refinement step, boosting their final pass rate by up to 63.5%.
- Efficiency is King: The training makes models dramatically more capable for their size.
- CYCLE-350M consistently outperforms the much larger StarCoder-1B model.
- CYCLE-1B performs on par with StarCoder-3B.
This demonstrates that targeted training for a specific skill (self-refinement) is far more parameter-efficient than simply scaling up a generalist model.
3. It’s Not Just Top-K Generation: A common way to boost pass rates is to generate many (k) samples at once (Top-K sampling). The authors show this is fundamentally different from CYCLE’s approach.
- Top-K explores in breadth: It generates diverse solutions from the same initial prompt.
- CYCLE explores in depth: It iteratively improves a single solution using new information (the feedback). The two approaches are orthogonal and can even be combined for even better performance.
Why This Matters for AI Experts
- Beyond Prompt Engineering: This paper signals a shift from trying to coax desired behaviours out of models with prompts to explicitly training them for those behaviours. For complex skills like debugging, fine-tuning on the task loop itself is far more effective.
- The Untapped Value of Execution Feedback: Automated feedback from compilers, linters, and test suites is a cheap, scalable, and powerful data source for improving code LLMs. CYCLE provides a blueprint for how to leverage it effectively.
- A Path to More Reliable AI Assistants: By learning to self-correct, models can move from being “fast but unreliable typists” to genuine collaborators. An AI that can take feedback, understand its mistakes, and offer a corrected version is infinitely more useful in a real-world development workflow. CYCLE is a significant step in that direction.
7. Agentless: Demystifying LLM-based software engineering agents
If you’ve been following AI for software engineering, you’ve seen the meteoric rise of “autonomous agents.” Inspired by systems like Devin, the goal has been to create AI agents that can think, plan, and use tools (like a terminal or file editor) to solve complex coding problems, mimicking a human developer. The consensus seemed to be that more complexity, more tools, and more autonomy were the path forward.
A new paper from the University of Illinois Urbana-Champaign, titled “AGENTLESS: Demystifying LLM-based Software Engineering Agents,” throws a wrench in that narrative. The authors ask a provocative question:
Do we really need these complex agents?
Their answer, backed by stunning results, is a resounding “maybe not.” They introduce AGENTLESS, a simple, deterministic approach that outperforms every open-source agent on the popular SWE-bench Lite benchmark, and does so at a fraction of the cost. The approach is so effective that OpenAI has already adopted it as the standard for showcasing the real-world coding abilities of GPT-4o and the new o1 models.
Let’s dive into what makes AGENTLESS tick and why it’s a wake-up call for the AI community.
The Problem with Today’s AI Agents
The paper starts by outlining why current autonomous agents, despite their promise, often fall short. They identify three key limitations:
- Complex Tool Usage: Agents need a carefully designed API (an “abstraction layer”) to interact with their environment (e.g., readFile(‘path/to/file.py’)). If this API is poorly designed or the agent misuses it, the entire process can derail, wasting time and money on failed attempts.
- Lack of Control in Planning: Agents are given the freedom to decide their next steps based on feedback. However, current LLMs can easily get confused by ambiguous feedback or a vast action space. They can go down rabbit holes, performing dozens of sub-optimal actions that are hard to debug.
- Limited Self-Reflection: Agents struggle to filter out irrelevant or misleading information. A single incorrect step or piece of bad feedback can be amplified, poisoning all subsequent decisions. They don’t have the robust “gut-check” mechanism that human developers do.
Analogy: Think of a complex AI agent as a novice intern you’ve given keys to the whole office and a vague goal like “improve our software.” They might wander around, try using complicated machinery they don’t understand, and make a mess based on a misunderstood comment. AGENTLESS, in contrast, is like giving that intern a very specific, step-by-step checklist to follow. No autonomy, but high precision and reliability.
The AGENTLESS Approach: A Simple Three-Phase Pipeline
Instead of an autonomous loop of plan -> act -> observe, AGENTLESS is a straightforward, three-phase process. It uses LLMs as powerful but controlled tools for specific tasks, not as autonomous decision-makers.
Here’s the breakdown:
The first challenge in a large codebase is finding the right place to make a change. AGENTLESS does this with a clever, funnel-like process that narrows down the search space.
- Localise to Files: First, it gives the LLM a tree-like view of the repository’s directory structure (just file and folder names, not code). It combines this with an embedding-based search (cosine similarity between the issue description and code chunks) to identify the top N most relevant files.
- Localise to Elements (Classes/Functions): Instead of feeding the entire content of these files to the LLM (which could be thousands of lines), it creates a “skeleton” of each file. This skeleton contains only class and function declarations (like a table of contents), which is much more concise. The LLM then points out the specific classes and functions within those files that are most relevant to the bug.
- Localise to Edit Locations: Finally, with a much smaller context (just the code for the identified classes/functions), it asks the LLM to pinpoint the exact lines or blocks of code that need to be edited.
Analogy: This is like a detective finding a suspect. First, they identify the right city (the files). Then, they narrow it down to a specific neighbourhood (the classes/functions). Finally, they find the exact house address (the edit location).
Once the locations are identified, AGENTLESS asks the LLM to generate patches. Critically, it doesn’t ask for the entire fixed file. Instead, it uses a simple Search/Replace diff format.
- The LLM outputs a <<<<<<< SEARCH block with the original code and a >>>>>>> REPLACE block with the new code.
- This is more efficient, cheaper, and less prone to hallucination than generating large blocks of code.
- To maximise its chances, AGENTLESS samples multiple potential patches (e.g., 40 candidates per issue) using a higher temperature setting in the LLM.
Analogy: This is like using “Find and Replace” in a document instead of rewriting the entire page. It’s a precise, surgical change rather than a risky, complete overhaul.
This is perhaps the most clever part. In real-world bug fixing, you don’t have a ready-made test that fails on the buggy code. AGENTLESS solves this by:
- Generating Reproduction Tests: It prompts the LLM to write a new test case specifically designed to reproduce the original bug. The test is written to print “Issue reproduced” on the original buggy code and “Issue resolved” on the correctly patched code. It generates many candidate tests and picks the most reliable one.
- Using Regression Tests: It runs the project’s existing test suite to ensure the patch doesn’t break any existing functionality.
- Filtering and Ranking: Patches are evaluated in a strict order:
- First, they must pass the regression tests. Any patch that breaks existing features is discarded.
- From the remaining candidates, it keeps only those that pass the newly generated reproduction test (i.e., make it print “Issue resolved”).
- Finally, if multiple patches remain, it uses majority voting on the normalised code changes to select the most frequently proposed solution.
Analogy: Before fixing a leaky faucet, you first turn on the water to confirm where it’s leaking (the reproduction test). After you fix it, you check that the toilet still flushes and the shower works (the regression tests).
The Shocking Results
The simplicity of AGENTLESS pays off handsomely. On the SWE-bench Lite benchmark (300 real-world GitHub issues):
- AGENTLESS solved 96 issues (32.00%), achieving the highest performance among all open-source approaches.
- It did this at an average cost of just $0.70 per issue, significantly cheaper than most agent-based systems.
- Its performance is highly competitive even with top-tier, closed-source commercial tools.
A Critical Look at the Benchmark: Introducing SWE-bench Lite-S
The authors didn’t stop there. They manually analysed all 300 problems in SWE-bench Lite and found some serious issues with the benchmark itself:
- Leaked Solutions: 4.3% of problems contained the exact ground-truth patch in the issue description. Another 9.7% gave a complete, step-by-step guide.
- Missing Information: 10% of issues were impossible to solve as described, requiring information (like a specific error message string) not provided in the prompt.
- Misleading Information: 5% of issue descriptions suggested a solution that was different from the actual ground-truth fix, potentially misleading AI tools.
To enable more rigorous and fair evaluation, they created SWE-bench Lite-S, a filtered version with these problematic issues removed. This is a crucial contribution to the community, pushing for better and more realistic benchmarks.
Why This Paper Matters: Key Takeaways
- It Resets the Baseline: AGENTLESS proves that a simple, deterministic pipeline can be more effective than a complex, autonomous agent. Any new, more complex agent must now prove it can beat this powerful and cost-effective baseline.
- Complexity is Not Always the Answer: The paper is a powerful reminder that in the rush to build AGI-like systems, we may be overlooking simpler, more robust engineering solutions that work better with the current limitations of LLMs.
- The Importance of Good Benchmarking: The creation of SWE-bench Lite-S highlights the need for rigorous, high-quality benchmarks to accurately measure progress in the field.
- Industry Validation: The fact that OpenAI quickly adopted AGENTLESS for its own high-profile model evaluations speaks volumes about the approach’s credibility and effectiveness.
In an era defined by the hype around god-like AI agents, AGENTLESS is a dose of brilliant, pragmatic engineering. It reminds us to first ask: what is the simplest thing that could work? As it turns out, the answer is “surprisingly well.”
8. SWE-agent: Agent-computer interfaces enable automated software engineering
What if the problem isn’t that agents are a bad idea, but that we’ve been forcing them to work with tools designed for humans?
That’s the provocative premise behind “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,” a landmark paper from Princeton. The authors argue that we should treat LLMs as a completely new category of end-user. Just as humans invented Integrated Development Environments (IDEs) like VS Code to make themselves better programmers, we need to build specialised interfaces for our AI agents.
They call this concept the Agent-Computer Interface (ACI), and their implementation, SWE-agent, shows just how transformative this idea can be.
The Problem: Forcing an AI to Use a Human’s Keyboard
Think about how a human uses a command line. We use short, cryptic commands like grep, sed, and cat, and we interpret pages of output, instinctively filtering out the noise. We rely on visual cues in our text editors and a lifetime of experience.
The authors of SWE-agent argue that asking an LLM to do the same is like asking a brilliant mathematician to write a novel using a hammer and chisel. The tool is simply not suited for the user. LLMs struggle with:
- Verbose Output: Commands like cat can flood the LLM’s context window with thousands of lines of irrelevant code, causing it to “lose the plot.”
- Complex Commands: Multi-line editing with sed or awk is notoriously difficult and error-prone, even for many human developers.
- Lack of Feedback: A command might fail silently, leaving the agent confused about the state of the system.
- Inefficient Exploration: An agent might get stuck in repetitive loops, like calling ls and cd over and over, because it lacks the high-level navigation tools a human uses in an IDE.
Analogy: Giving an LLM a standard Linux shell is like asking someone to navigate a city using only a list of every single street name, with no map. An ACI is like giving them Google Maps — an interface designed to make navigation intuitive by providing the right information at the right time.
The Solution: The Agent-Computer Interface (ACI)
The core idea of SWE-agent is to build an abstraction layer between the LLM and the computer that is tailored to the LLM’s strengths and weaknesses. They designed their ACI around four key principles:
- Simple, Understandable Actions: Replace complex bash commands with a small set of simple, well-documented commands. For example, instead of a complex find command, SWE-agent has find_file and search_dir.
- Compact and Efficient Actions: Consolidate multi-step human workflows into single actions for the agent. The prime example is the edit command, which replaces a specific range of lines in one go, a task that would be a nightmare with standard shell tools.
- Informative but Concise Feedback: When a command runs, give the agent just the information it needs. If a search returns too many results, don’t show them all; instead, tell the agent to refine its query. This protects the precious context window.
- Guardrails to Prevent Errors: Actively prevent the agent from making common mistakes. This is the most brilliant part of the design.
Analogy for Guardrails: When you’re writing code in VS Code and make a syntax error, a red squiggly line appears immediately. SWE-agent’s ACI does the same for the LLM. When the agent proposes a code edit, the ACI runs a linter (a syntax checker) on it. If the edit introduces a syntax error, the change is rejected, and the agent is told exactly what went wrong. This prevents a small mistake from cascading into a series of failed steps.
How SWE-agent Works in Practice
SWE-agent provides the LLM with a small but powerful toolkit:
- File Viewer: The open command doesn’t dump the whole file. It shows a 100-line window with line numbers and context (e.g., “400 lines above,” “2684 lines below”). The agent can navigate with scroll_up/down or jump directly with goto <line_number>.
- File Editor: The edit <start_line>:<end_line> command is the star of the show. It allows precise, multi-line edits. After an edit, the File Viewer automatically refreshes to show the change, providing immediate feedback.
- Search Tools: find_file, search_file, and search_dir are designed to return summarised, clean results, preventing context overflow.
- Context Management: The system automatically “collapses” old interactions in the prompt history, keeping only a one-line summary. This saves tokens and keeps the agent focused on the most recent, relevant information.
The Results: A Resounding Success
By giving the agent a better interface, the SWE-agent achieved state-of-the-art performance:
- It solved 12.47% of the full SWE-bench dataset, a massive leap from the previous best of 3.8% from a non-interactive system.
- On SWE-bench Lite, it showed a 64% relative improvement over an agent using just a standard shell, proving that the ACI, not just the LLM, was the key to success.
- The system is portable; when they swapped GPT-4 for Claude 3 Opus, it still performed exceptionally well.
The analysis of why it works is just as interesting:
- Editing is Critical: Ablation studies showed that removing the custom edit command caused performance to plummet.
- Guardrails are Essential: Disabling the linter guardrail also significantly hurts performance, as agents would get stuck trying to recover from their own syntax errors.
- Failure is Informative: Most failures weren’t because the agent couldn’t find the right file, but because its proposed fix was logically incorrect. The ACI helps with the mechanics of programming, but the core reasoning challenge remains.
Why This Paper Matters: Key Takeaways
- It introduces a New Field: Agent-Computer Interaction (ACI): This paper formalises the idea that we need to design interfaces for AIs, just like we have a whole field (HCI) for designing interfaces for humans. This is a powerful new lens for thinking about agent development.
- It’s a Counterpoint to the “Less is More” Narrative: While AGENTLESS showed the power of simplicity, SWE-agent shows that complexity and autonomy can work, provided the agent is properly equipped. The future might not be about less powerful agents, but about agents with better tools.
- It Shifts Focus from the Model to the Environment: This work proves that you can get massive performance gains without touching the LLM’s weights. Improving the agent’s environment is just as important as improving the agent’s “brain.”
SWE-agent doesn’t just build a better coding agent; it presents a new philosophy for how we should build agents for any complex digital task. The question is no longer just “How smart can we make the LLM?” but also, “How smart can we make the world it works in?”
9. CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges
Large Language Models (LLMs) like GPT-4 and Code Llama are impressive at writing isolated functions, but they often stumble when faced with the complexity of a real-world software repository. Real code isn’t written in a vacuum; it depends on a web of existing classes, utility functions, project-specific conventions, and documentation. A new paper from Peking University introduces CODEAGENT, a framework designed to bridge this gap by giving LLMs an agentic structure and a toolkit that mimics how human developers work.
This summary will break down the core problem CODEAGENT addresses, its ingenious solution involving tools and strategies, the new benchmark created to test it, and the compelling results that show it outperforming even commercial giants like GitHub Copilot.
The Core Problem: From Isolated Snippets to Living Codebases
Most code generation benchmarks, like HumanEval, test an LLM’s ability to write a single, self-contained function. The paper calls this “function-level” generation. However, the authors point out that over 70% of functions in open-source projects are non-standalone. They introduce and formalise the concept of repo-level code generation, a much more realistic and challenging task.
Analogy: The Master Chef vs. The Line Cook
- Function-level generation is like asking a chef to execute a single recipe for a perfect chocolate lava cake in a sterile, empty kitchen. They are given the recipe (the prompt) and all the ingredients (standard libraries), and their single output is judged.
- Repo-level generation, on the other hand, is like asking that chef to create a new dessert for a busy restaurant’s menu. They must work within the existing kitchen (runtime environment), use pre-made sauces and components from the pantry (contextual dependencies), and ensure their new dessert stylistically and functionally complements the existing menu (documentation and project conventions).
To solve this, a task must provide the LLM with three key pieces of information:
- Documentation: Detailed descriptions of the target classes/functions, including parameters, return values, and explanations of domain-specific terms.
- Contextual Dependency: The full context of the existing codebase. The LLM needs to know what user-defined classes and functions it can (or must) use from other files in the repository.
- Runtime Environment: A pre-configured environment to execute and test the generated code, allowing for iterative debugging based on execution feedback.
The Solution: CODEAGENT — An LLM with a Developer’s Toolkit
CODEAGENT is an LLM-based agent framework that equips a base LLM with external tools, allowing it to interact with the code repository much like a human would.
The framework integrates five specially designed programming tools, grouped into three categories:
- Information Retrieval:
- WebSearch: Uses DuckDuckGo to search for general programming concepts or solutions online, mimicking a developer searching on Stack Overflow or reading a tutorial.
- DocSearch: A specialised search tool (using BM25) to find relevant information within the project’s own documentation.
2. Code Implementation:
- SymbolSearch: This is arguably the most critical tool. Built with tree-sitter, it allows the agent to perform static analysis on the codebase. It can list all functions and classes in a file or retrieve the full source code for a specific function/class.
- Analogy: This tool is the equivalent of a developer using “Go to Definition” or navigating the file tree in their IDE to understand how existing code works before writing new code.
3. Code Testing:
- FormatCheck: Uses the Black code formatter to automatically fix syntax and style issues.
- PythonREPL: A code interpreter that provides a sandbox to execute the generated code, run tests, and get immediate feedback (including error tracebacks) for debugging.
Simply having tools isn’t enough; the agent needs a strategy to decide when and how to use them. CODEAGENT explores four different strategies to guide the LLM’s reasoning process:
- ReAct (Reasoning and Acting): The agent works in a loop. It first generates a “thought” (e.g., “I need to understand what functions are available in utils.py”) and then an “action” (e.g., SymbolSearch(‘utils.py’)). It observes the tool’s output and uses that new information to inform its next thought and action. It’s an iterative, reactive approach.
- Tool-Planning: The agent first creates a high-level plan (e.g., “Step 1: Search the web for the algorithm. Step 2: Find the base class in the repo. Step 3: Implement the new class. Step 4: Test the code.”). It then executes each step, calling tools as needed to complete the subtasks.
- Rule-based Tool Usage: A more rigid, predefined workflow that mimics a standard developer process: (1) Use retrieval tools, (2) use implementation tools to write code, (3) use testing tools to verify and debug. The agent cycles through tools in each stage until it’s satisfied.
- OpenAIFunc: Utilises the native function-calling capabilities of models like GPT-4, which is structurally similar to the ReAct approach.
The Proving Ground: CODEAGENTBENCH and Experimental Results
To properly evaluate their method, the researchers created a new benchmark, CODEAGENTBENCH. It consists of 101 challenging repo-level tasks sourced from real, high-quality Python projects on GitHub across five domains (Machine Learning, Data Structures, etc.).
Key Findings:
- Massive Performance Gains: On CODEAGENTBENCH, the framework provided a significant boost to all tested LLMs. The Pass@1 rate improved by 2.0 to 15.8 percentage points. For GPT-4-turbo, the pass rate jumped from 21.8% (NoAgent) to 37.6% (CODEAGENT with Rule-based strategy), a 72.7% relative improvement.
- Generality: CODEAGENT also improved performance on the simpler, function-level HumanEval benchmark, demonstrating its versatility.
- The MVP Tool: An ablation study revealed that the Code Symbol Navigation tool was the most crucial. Removing it caused the largest drop in performance, confirming that the ability to understand and navigate existing code is paramount for repo-level tasks.
- Outperforming Commercial Products: In a manual evaluation on a subset of the benchmark, CODEAGENT (with both GPT-3.5 and GPT-4) solved significantly more problems than GitHub Copilot, Amazon CodeWhisperer, and even the agent-based AutoGPT.
A Concrete Example: Seeing is Believing
In a case study, the task was to implement a PolynomialKernel class.
- The baseline model (NoAgent) generated a standalone class from scratch, completely missing that it was supposed to inherit from a KernelBase class defined elsewhere in the repository. The code was functionally incorrect.
- CODEAGENT, in contrast, used SymbolSearch to inspect the utils.kernels module. It discovered the KernelBase class, examined its source code to find that it had an abstract method _kernel that needed to be implemented, and correctly inherited from the base class. This led to a correct and seamlessly integrated solution.
Takeaways and Future Directions
CODEAGENT represents a significant step forward in making AI code generation practical for real-world software engineering. By formalising the “repo-level” task and equipping LLMs with a developer’s toolkit, the authors have shown a clear path to overcoming the limitations of current models.
The paper is also transparent about its limitations, including the need for more advanced tools, more rigorous evaluation against commercial IDEs, and prompt optimisation. Nonetheless, it provides a powerful proof-of-concept: the future of AI programming assistants lies not in generating isolated snippets, but in creating intelligent agents that can read, understand, and contribute to complex, living codebases.
10. Debug like a human: A large language model debugger via verifying runtime execution step-by-step
When an LLM generates code that fails a unit test, the standard approach is to feed the error message back to the model and ask it to try again. This is a crude, black-box process. It’s like telling a chef their soup is too salty without letting them taste the individual ingredients to find out why. Human developers don’t work this way; they fire up a debugger, set breakpoints, and inspect the program’s internal state as it runs to find the precise point of failure.
A new paper from UCSD, “Debug like a Human,” introduces a framework called LDB (Large Language Model Debugger) that brings this powerful, human-like debugging process to LLMs, leading to significant improvements in code correctness.
This summary will unpack the core limitation of existing debugging methods, explain LDB’s ingenious step-by-step verification process, and highlight the impressive results showing its ability to fix bugs that even advanced code generation systems miss.
The Core Problem: Black-Box vs. White-Box Debugging
Current iterative refinement methods, like “Self-Debugging,” treat the generated program as an indivisible entity. They execute the entire program, look at the final output or error, and use that high-level feedback to guess at a fix. This approach struggles with complex logic because the LLM can’t see where, inside the execution, the logic went wrong. It often hallucinates an incorrect execution path or miscalculates intermediate values, leading it to “fix” the wrong part of the code or get stuck in a loop of incorrect attempts.
Analogy: The Apprentice Mechanic
- Existing Methods (Black Box): An apprentice mechanic builds an engine. You try to start the car, and it sputters and dies. You tell the apprentice, “It didn’t work, try again.” The apprentice has to guess whether the problem was the fuel line, the spark plugs, or the timing belt.
- LDB (White Box): You hook the engine up to diagnostics. You tell the apprentice, “The fuel pump is active, pressure is good. The spark plugs are firing, and timing is correct. Ah, but the sensor at the camshaft is reading an error.” Now, the apprentice knows exactly where to focus their effort.
LDB gives the LLM this diagnostic-level insight by letting it inspect the program’s runtime execution.
The Solution: The LDB Workflow
LDB emulates the process of using an interactive debugger (like GDB or PDB). It breaks down the monolithic task of “fixing the code” into a series of smaller, verifiable steps. The workflow is as follows:
- Generate Seed Program: An LLM generates an initial version of the code.
- Profiling (Collecting Evidence): If the seed program fails a visible test case, LDB’s profiling stage kicks in.
- Build Control Flow Graph (CFG): First, it performs a static analysis of the code to create a CFG — essentially a flowchart of all possible execution paths. This divides the code into basic blocks. A basic block is a straight-line sequence of code with no branches in or out; it’s a fundamental unit of execution.
- Trace Execution: LDB then runs the failing test case and records the actual sequence of basic blocks that were executed — the execution trace.
- Track Intermediate States: For each step in the trace, LDB records the values of all relevant variables before and after the block is executed. This is the equivalent of setting a breakpoint at the end of each block.
3. Debugging (Step-by-Step Verification): This is the core innovation. LDB presents the collected evidence to an LLM. For each block in the execution trace, it asks:
- “Given the task description, here is Block #N. Before this block, the variables were {V_in}. After executing it, the variables are {V_out}. Is this block’s behaviour Correct or Incorrect? Please explain why.”
- The LLM provides a verdict and a justification for each block. This forces the model to reason about small, manageable code units and their specific effect on the program’s state, making it much easier to pinpoint the exact line or block containing the logical error. To improve efficiency, LDB uses Batch Debugging, sending the information for all blocks in a single query.
- Regeneration: LDB collects all the “Incorrect” verdicts and their explanations. It feeds this highly specific, localised feedback to the LLM and asks it to regenerate the program. This process is repeated until the code passes all visible tests or a maximum number of iterations is reached.
Key Findings and Experimental Results
LDB was tested on the HumanEval, MBPP, and TransCoder benchmarks using GPT-3.5, CodeLlama, and StarCoder as backbones.
- Consistent and Significant Improvements: LDB boosted the Pass@1 accuracy by up to 9.8% over the baseline, consistently outperforming prior “Self-Debugging” methods across all models and datasets.
- Fixing Bugs in SOTA Models: In a compelling experiment, LDB (using a weaker GPT-3.5 backbone) was tasked with debugging code generated by the much more powerful GPT-4 and the advanced Reflexion agent. It still managed to find and fix bugs, improving GPT-4’s score by 2.4% and Reflexion’s by 3.6%, achieving a new state-of-the-art of 95.1% on HumanEval. This demonstrates that LDB is an orthogonal improvement that adds value even on top of the best existing methods.
- Superior Long-Term Improvement: A performance-vs-iteration analysis showed that while other methods plateau after 2–3 attempts, LDB continues to improve its performance over many iterations. This is because it receives new, factual runtime information in each cycle, preventing it from getting stuck in a loop of flawed internal reasoning.
- “Basic Blocks” are the Right Granularity: An ablation study compared breaking the code down by line, by basic block, and by function. Line-level was too granular and lost semantic context, while function-level was too coarse to pinpoint bugs effectively. Basic blocks proved to be the “Goldilocks” level of decomposition, yielding the best performance with the most efficient token usage.
A Concrete Example
In a case study, the task was to check if a list is sorted and has no more than one duplicate of any number (e.g., [1, 2, 2] is okay, but [1, 2, 2, 2] is not). The seed program mistakenly checked for lst.count(x) > 1 (any duplicates are bad).
- LDB ran a failing test case: is_sorted([1, 2, 2, 3, 3, 4]) should be True but returned False.
- It traced the execution. The initial loop checking the sort order was verified as Correct block-by-block.
- When it reached the final block containing the return not any(lst.count(x) > 1 …) statement, it presented the state to the LLM. The LLM saw that this block was causing the incorrect final output and correctly diagnosed the logical error, flagging the block as Incorrect and explaining that the condition should be > 2.
- The regeneration step used this precise feedback to produce the correct code.
Takeaways
LDB is a paradigm shift for automated code debugging. By moving from high-level, post-execution feedback to fine-grained, step-by-step verification of runtime states, it provides LLMs with the grounded information they need to reason accurately about program logic. It effectively gives the LLM a debugger and teaches it how to use it, proving that the path to more reliable AI code generation lies in emulating the powerful, methodical processes of human developers.
11. Large Language Model Guided Self-Debugging Code Generation
This paper introduces PyCapsule, a new framework for automated Python code generation that significantly improves accuracy and computational efficiency. Unlike complex multi-agent systems, PyCapsule uses a streamlined two-agent architecture, a programmer agent (an LLM) and an executor agent (a non-LLM validator), complemented by three specialised modules. This design enables an effective self-debugging loop where the model iteratively generates, tests, and fixes its own code. PyCapsule achieves state-of-the-art results on major benchmarks like HumanEval and BigCodeBench, even allowing smaller models to outperform much larger ones. A key finding is that the effectiveness of self-debugging decreases exponentially with each attempt, suggesting that after a few tries, the remaining bugs are too complex for the model to solve with simple error feedback.
1. The Core Problem
While Large Language Models (LLMs) are increasingly proficient at generating code from natural language, they face several challenges:
- Reliability and Accuracy: LLMs often produce code with subtle bugs or syntax errors.
- Inefficiency of Current Solutions: Many state-of-the-art systems (e.g., MapCoder, AgentCoder) use multiple LLM-based agents for tasks like planning, coding, and debugging. This approach, while powerful, is computationally expensive, requiring numerous API calls and complex coordination between agents, making it slow and costly for real-world use.
- Poor Error Handling: Models often struggle to effectively use the raw, verbose error messages produced by compilers, hindering their ability to debug effectively.
PyCapsule aims to solve these problems by creating a lightweight, efficient, and robust framework that generates high-quality code with minimal overhead.
2. The Solution: The PyCapsule Framework
PyCapsule’s innovation lies in its simple yet effective architecture, which can be broken down into two main components: a two-agent pipeline and specialised supportive modules.
A. The Two-Agent Pipeline
Instead of a crowd of LLM agents, PyCapsule uses just two, each with a distinct role.
- Programmer Agent (The “Smart” Coder): This is an LLM (like GPT-4 or Qwen). It’s responsible for the “thinking” tasks. It operates in two modes:
- Generation Mode: Given a problem description, it uses Chain-of-Thought (CoT) reasoning to analyse the problem, plan a solution, and write the initial Python code.
- Fix Mode: If the code fails, this agent receives the original problem, its previous failed code, and a processed error message. It then attempts to debug the code.
- Executor Agent (The “Diligent” Tester): This is not an LLM. It is a deterministic, automated system built around a Docker container. Its job is to safely execute the code, run it against test cases, and report back.
- Analogy: Think of the Programmer Agent as a senior developer who writes and refines code. The Executor Agent is like an automated Quality Assurance (QA) pipeline. It doesn’t write code, but it meticulously sets up a clean test environment (the Docker container), runs the tests, and provides a clear “pass” or “fail” report. This division of labour is key to PyCapsule’s efficiency.
B. The Supportive Modules
To make the process more reliable and reduce the LLM’s workload, PyCapsule uses three deterministic (non-AI) modules that handle specific, repeatable tasks.
- Signature Converter: Some datasets (like MBPP) don’t provide a clean function signature. This module automatically creates one by looking at the first test case.
- Example: If a test case asserts my_function(5) == 25, the module generates a template like def my_function(arg_int: int): for the LLM to fill in. This prevents the LLM from hallucinating function names or argument structures, ensuring stability.
- Example Call Detector: LLMs sometimes include example function calls (e.g., print(my_function(5))) in their code output. These can cause infinite loops or interfere with the testing framework. This module simply scans the code and removes any such calls before execution, acting as a safety mechanism.
- Error-Handling Module: This is a crucial component for effective debugging. Raw compiler errors are often long and cryptic. This module processes them into concise, actionable feedback for the Programmer Agent.
- It identifies the error type (e.g., AssertionError, RecursionError).
- It filters out irrelevant parts of the error traceback.
- It provides a simple natural language summary, like: “Your generated solution failed a test case. Please improve the logic of your solution.”
- Analogy: This module acts like a technical lead who translates a confusing stack trace into a clear, one-sentence bug report for a junior developer. This helps the Programmer Agent focus on the actual problem.
3. The Code Generation Workflow
The process is an iterative loop:
- Generate: The Programmer Agent receives a problem and generates the first version of the code.
- Process: The code is cleaned by the Example Call Detector, and any necessary libraries are noted.
- Execute: The Executor Agent runs the code in a secure Docker container against all test cases.
- Evaluate & Loop:
- If all tests pass, the code is returned as the final solution.
- If any test fails, the Error-Handling module refines the error message. The workflow loops back to step 1, but this time the Programmer Agent is in “Fix Mode” and receives the refined feedback to generate a new version.
- This loop repeats for a maximum of five attempts.
4. Key Results and Findings
PyCapsule was tested on several standard coding benchmarks (HumanEval, MBPP, BigCodeBench) and demonstrated superior performance.
- State-of-the-Art Performance: PyCapsule achieved significant success rate improvements over existing methods:
- Up to 5.7% on HumanEval
- Up to 10.3% on HumanEval-ET (with more test cases)
- Up to 24.4% on BigCodeBench
- Model Empowerment: The framework is so effective that it elevates the performance of smaller models. With PyCapsule, the Qwen-2.5-Coder-7B model achieved a 94.1% success rate on HumanEval, outperforming the standalone, much larger Qwen-32B model (92.7%).
- Finding: The Exponential Decay of Self-Debugging: The paper’s most interesting analytical insight is how the effectiveness of debugging diminishes over time.
- What it means: The first debugging attempt after an initial failure is highly likely to fix the problem. The second attempt is less likely to succeed, the third even less so, and so on. The success rate follows an exponential decay curve.
- Analogy: Imagine fixing bugs in a program. The first pass catches all the easy typos and simple logic errors. The bugs that remain are much deeper and more complex. Another debugging attempt with the same kind of feedback (the error message) is less likely to solve these harder problems.
- This finding empirically justifies the decision to limit debugging to five attempts, as the chances of success become negligible after that point.
5. Why This Paper Matters (Contributions)
- Efficiency and Cost-Effectiveness: PyCapsule proves that high performance in code generation doesn’t require a large, complex ensemble of LLM agents. By using a single LLM and smart, deterministic modules, it drastically reduces the number of API calls (e.g., averaging ~1.4–2.1 calls per problem vs. 12–17 for other systems), saving significant time and money.
- A Blueprint for Lightweight AI Systems: The two-agent architecture provides a model for building powerful but efficient AI systems. It demonstrates how to offload repeatable, structured tasks to non-AI modules, letting the LLM focus on high-level reasoning.
- Deeper Understanding of Self-Debugging: The analysis of diminishing returns provides valuable insight into the limits of current self-correction mechanisms in LLMs, suggesting that more advanced feedback or different reasoning strategies are needed to solve the hardest problems.
12. Guided code generation with LLMs: A multi-agent framework for complex code tasks
This paper introduces a “guided code generation” framework that treats complex coding tasks like an assembly line, systematically breaking them down into tiny, manageable pieces and then reassembling them. It’s designed to overcome two fundamental weaknesses of Large Language Models (LLMs): their inability to reason over long contexts and their poor compositional ability (combining simple skills to solve complex problems). The framework uses a multi-agent system where a “Generalist” agent acts as an architect, a “Code” agent as a builder, and “Critic” and “Tester” agents as quality inspectors. By solving problems from the bottom up, starting with the simplest components, it dramatically improves reliability. When tested on the HumanEval benchmark, this method improved the accuracy of a small, quantised Llama 3.1 8B model by 23.79%, and even succeeded at generating complex software that much larger models like GPT-4o and Gemini 1.5 Pro failed to produce.
1. The Core Problem: Why LLMs Fail at Complex Code
The paper argues that despite their strengths, LLMs are fundamentally flawed for complex, real-world software development due to two key limitations:
- Poor Long-Context Reasoning: While modern LLMs have massive context windows (millions of tokens), they struggle to reason with all that information simultaneously. They are excellent at “needle-in-a-haystack” tasks (finding a specific fact in a large document) but fail when asked to synthesise information from many different parts of that context to perform a complex task. Performance degrades uniformly as the context length increases.
- Analogy: An LLM is like a library assistant who can instantly find any sentence in any book in a vast library. However, if you ask them to write a new, coherent chapter that synthesises complex arguments from 100 different books, they become overwhelmed and produce a weak, disjointed summary.
- Limited Compositional Ability: This is the paper’s central focus. LLMs are good at performing single, learned skills (e.g., writing a for loop, defining a function) but struggle to combine multiple simple skills in a specific sequence to solve a novel, multi-step problem. The paper cites research showing that simply making models bigger does not solve this issue for complex reasoning tasks.
2. The Solution: A Multi-Agent Assembly Line for Code
The proposed framework tackles these issues by enforcing a highly structured, decomposition-based process. It uses multiple specialised agents in a three-phase workflow.
Step 1: The Architect (Hierarchical Problem Decomposition)
A Generalist Agent first takes the complex coding task and recursively breaks it down into a tree of smaller sub-problems. This process continues until it reaches “atomic” units, simple, individual functions that cannot be broken down further (these are the “leaf” nodes of the tree).
Step 2: The Builder (Bottom-up Code Generation)
A Code Agent then starts building the solution from the bottom of the tree upwards.
- Solve the Leaves: It first generates the code for the simplest, atomic functions at the leaf nodes. Each tiny piece of code is immediately tested and validated.
- Compose Upwards: To solve a “parent” node, the Code Agent is given the function signatures and documentation of its already-built “child” nodes, but crucially, it cannot see their implementation details. It must compose the solution using only these high-level interfaces. This process repeats until the agent reaches the root of the tree, and the final solution is assembled.
Step 3: The Inspectors (Multi-Agent Validation)
Throughout the process, two other agents ensure quality:
- Critic Agent: Reviews the code for logic, efficiency, and correctness, providing qualitative feedback.
- Tester Agent: Runs automated tests to find bugs and provides quantitative feedback.
This continuous feedback loop ensures that errors are caught early at the component level, preventing them from propagating into the final, complex solution.
3. Key Results and Findings
- Quantitative Success: The framework achieved a 56.2% Pass@1 score on HumanEval using a quantised Llama 3.1 8B model. This is a 23.79% relative improvement over the same model’s direct, one-shot generation score of 45.4%. This shows that a better process can dramatically elevate the performance of even smaller models.
- Qualitative Breakthrough: In a more complex task, the framework successfully built a complete mathematical function evaluator — including a lexer, parser, and evaluation engine with error handling. In contrast, frontier models like GPT-4o and Gemini 1.5 Pro either refused to attempt the task or produced overly simplistic, incorrect solutions. This highlights the framework’s ability to handle complexity that stumps even the most powerful single-shot models.
4. Why This Paper Matters (Contributions)
- Process Over Power: This work strongly suggests that the key to unlocking LLMs for complex software engineering isn’t just about building bigger models, but about implementing smarter, more structured generation processes inspired by traditional software engineering principles (like modularity and unit testing).
- A New Theoretical View of Code Generation: The paper proposes reframing code generation as a dual problem:
- Generating atomic units is an information retrieval problem (the LLM finds the closest pattern it has seen).
- Combining these units is a composition problem that should be handled at the interface level, not the implementation level.
- Practical Framework for Small Models: It provides a clear blueprint for how to get high-quality results from smaller, more accessible models by mitigating their inherent weaknesses through a clever agentic structure. This makes advanced code generation more feasible for those with limited computational resources.
13. Revisit self-debugging with self-generated tests for code generation
The paper critically re-evaluates the efficacy of self-debugging in code generation, specifically in the realistic scenario where models must rely on their own self-generated tests instead of pre-existing oracle tests. While prior works like Reflexion and AlphaCodium have shown promise, their evaluation methodologies often relied on hidden oracle tests, obscuring the true capability and inherent limitations of debugging with self-generated tests. This work aims to provide a more transparent and practical assessment by isolating the self-debugging process entirely to self-generated feedback.
The authors propose and formalise two distinct paradigms for utilising execution feedback, based on the type of information the LLM receives:
Post-Execution Self-Debugging: This is the conventional approach. The model generates code and a test suite (input-output pairs). It then executes the code on the test inputs. The feedback consists of the outcome: whether the program’s output matches the self-generated expected output, or any runtime errors.
- Analogy: This is like a programmer writing a function and a unit test for it. They run the test. If it fails (e.g., assert add(2, 2) == 5 fails because the output is 4), they are given the failed assertion (input: (2,2), expected: 5, got: 4) and told to fix the code. The critical weakness here is if the self-generated test itself is wrong (e.g., the programmer mistakenly wrote assert add(2, 2) == 5).
In-Execution Self-Debugging: This paradigm avoids relying on the final pass/fail signal. Instead, the model is given a detailed execution trace for a given test input. The program is broken down into basic blocks (sequences of operations between control flow changes), and the model sees the state of all variables before and after each block.
- Analogy: Instead of just seeing the final failed test, the programmer uses a step-by-step debugger. They don’t need to trust the expected_output of the test. They can inspect the program’s internal logic as it runs: “Okay, a is 2, b is 2. After result = a + b, the value of result is 4. This seems logically correct according to the function’s purpose.” This allows the model to reason about the correctness of the process rather than just the correctness of the outcome, thus sidestepping the issue of a faulty test.
3. Experimental Setup
- Models: GPT-4o, Claude-3.5-Sonnet, Llama-3–70B-Instruct, and Qwen2.5-Coder-7B-Instruct.
- Benchmarks: A mix of basic problems (HumanEval, MBPP) and more difficult competitive programming problems (LiveCodeBench).
- Methodology: All experiments use greedy decoding (temperature=0). The model first generates a program and 10 self-generated tests. Then, it attempts to debug the program for up to two iterations using one of the two paradigms.
4. Key Findings & Analysis
Finding 1: Post-Execution Self-Debugging Struggles on Basic Problems.
Contrary to expectations, using post-execution feedback with self-generated tests often degrades performance on HumanEval and MBPP compared to the initial one-pass generation (Tables 1 vs. 2). For example, Claude-3.5-Sonnet’s HumanEval+ score drops from 89.0% to 79.3% after two iterations with detailed feedback.
Finding 2: The “Self-Testing Bias” Causes Inconsistent Performance.
This is the paper’s central insight. The performance degradation is caused by a bias introduced by unreliable self-generated tests. The authors analyse the outcomes of self-testing using a confusion matrix:
- False Positive (FP): A buggy program passes the flawed tests.
- False Negative (FN): A correct program fails the flawed tests.
The analysis (Figure 2) reveals a crucial dynamic:
- On basic problems (HumanEval/MBPP), models are often correct in their initial attempt. When this correct code is evaluated against their own imperfect tests, it frequently results in a False Negative. The model is then prompted to “fix” perfectly good code to satisfy a buggy test, leading to performance degradation.
- On competitive problems (LiveCodeBench), the initial code is more likely to be incorrect. Therefore, when a test fails, it is more likely to be a True Negative. In this case, even an imperfect test can provide a useful signal to fix a genuine bug, leading to modest performance improvements (e.g., GPT-4o on LiveCodeBench, Table 4). This explains the inconsistent results across different task difficulties.
Finding 3: In-Execution Self-Debugging is More Robust and Effective.
By focusing on intermediate execution traces rather than final outputs, the in-execution paradigm largely mitigates the self-testing bias. It allows the model to reason about the program’s logic directly, without being misled by incorrect expected outputs in the self-generated tests.
- Results: This approach shows consistent, albeit modest, improvements across both basic and competitive benchmarks for most models (Tables 5 & 6). For instance, GPT-4o’s MBPP+ score improves from 76.5% to 79.1% after two iterations, whereas the post-execution approach showed no improvement.
5. Conclusion & Implications for Experts
This work serves as a crucial reality check for the self-debugging community. The key takeaways are:
- Naïve post-execution self-debugging is unreliable. Simply feeding back pass/fail signals from self-generated tests is prone to a negative bias that can actively harm performance, especially on tasks where the model’s initial accuracy is high.
- The quality of generated tests is the bottleneck. Before leveraging execution feedback, the community must focus on improving the reliability of LLM-generated tests.
- In-execution reasoning is a promising path forward. Shifting from outcome-based feedback to process-based feedback (i.e., execution traces) offers a more robust mechanism for self-correction. This suggests future work should focus on designing better trace representations and improving the model’s ability to reason over these complex, fine-grained signals. It aligns with a broader trend of moving towards more deliberative, step-by-step reasoning in LLMs.
14. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications
This paper provides a comprehensive, structured survey of the current landscape of Large Language Models (LLMs) in the domain of code generation. It is not a research paper presenting novel techniques but rather a consolidation of existing knowledge, organised into four main pillars. The goal is to offer a holistic overview for researchers and practitioners, mapping out the primary obstacles, the methods to overcome them, the standards for measuring success, and the practical applications in the field.
2. Pillar 1: Challenges and Limitations
The survey categorises the primary difficulties LLMs face in code generation into four areas:
- Resource Constraints: This covers the immense computational cost of both training (e.g., Llama 3.1–405B requiring 31M GPU hours) and inference (e.g., LLaMa-70B needing 140GB of VRAM). It also highlights the performance-vs-efficiency trade-offs in quantisation methods (like GPTQ, AWQ) and notes the counter-intuitive finding that smaller models can outperform larger ones in budget-constrained (“small budget regime”) scenarios.
- Syntactic and Semantic Errors: A key insight is that modern LLMs rarely produce syntactically invalid code (<10% of errors). The dominant failure mode is semantic errors (>50% on complex benchmarks), where the code is syntactically correct but logically flawed, misunderstands requirements, or hallucinates functionality.
- Analogy: A syntactic error is a typo or grammatical mistake. A semantic error is a grammatically perfect sentence that states something factually incorrect. The latter is far more difficult to detect and correct.
- Biases: The paper discusses two forms of bias:
- Multi-Lingual Bias: Models exhibit significant performance degradation when prompted in non-English languages (e.g., a 17.2% Pass@1 drop for Chinese prompts) or when generating code in less-represented programming languages.
- Social Bias: Larger, more capable models like Codex can exhibit more severe social biases (e.g., associating certain religions with derogatory terms), revealing a troubling trade-off between performance and fairness.
- Security Risks: Vulnerabilities are introduced from two primary sources:
- Training Data: Models are trained on vast amounts of unsanitized code from public repositories like GitHub, which are rife with existing vulnerabilities (e.g., 81% of codebases in one report had at least one).
- Model Weakness: LLMs often fail to handle security-critical scenarios, with studies showing tools like GitHub Copilot producing vulnerable code in ~40% of relevant cases.
3. Pillar 2: Fine-Tuning Techniques
This section surveys methods to enhance and adapt pre-trained LLMs for coding tasks:
- Domain-Specific Datasets: This covers techniques like instruction-tuning on code optimisation tasks (LLaMoCo), the use of Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA to adapt models with minimal computational cost, and data pruning, which surprisingly can improve performance by removing redundant or low-quality data from the fine-tuning set.
- Feedback Mechanisms: This focuses on iterative refinement loops.
- ClarifyGPT: An interactive framework where the model asks clarifying questions to resolve ambiguity in the user’s prompt before generating the final code.
- RLEF (Reinforcement Learning from Execution Feedback): Fine-tuning using signals from a code executor (pass/fail on unit tests, compiler errors). This provides a more objective and scalable reward signal than standard RLHF.
- Analogy: RLEF is like getting feedback directly from a compiler, which is objective and automated, whereas standard RLHF is like getting feedback from a human code reviewer, which can be subjective and slower.
- Prompt Engineering: This explores advanced prompting strategies beyond simple zero-shot. Key examples include Chain-of-Thought to generate “solution plans” before code (CodePLAN), retrieval-augmented prompting to find relevant code examples before generation (AceCoder), and systematic investigation of different prompt styles for generating secure code.
4. Pillar 3: Evaluation (Metrics and Benchmarks)
The survey distinguishes between the “how” (metrics) and the “what” (benchmarks) of evaluation.
Metrics:
- CodeBLEU: An improvement over BLEU that incorporates code-specific features like syntactic matching (via Abstract Syntax Trees — ASTs) and semantic matching (via data-flow analysis).
- pass@k: The industry standard for functional correctness, measuring the probability that at least one of k generated samples passes all unit tests.
- pass-ratio@n: A more granular metric than pass@k that calculates the average percentage of test cases passed across n generated solutions, effectively giving partial credit.
- ICE-Score: A novel approach that uses a powerful LLM (e.g., GPT-4) as an evaluator to score the correctness and usefulness of code generated by another model.
Benchmarks:
- HumanEval: The foundational benchmark for function-level Python code synthesis.
- ClassEval: A more challenging benchmark focused on generating entire classes, testing a model’s ability to maintain state and inter-method consistency.
- SWE-bench: A highly realistic benchmark consisting of real-world GitHub issues. It highlights the vast gap between current model capabilities and real-world software engineering, with the best models solving less than 2% of the tasks.
- BigCodeBench: A recent, large-scale benchmark designed to be “contamination-free” and require more complex library usage, aiming to be a more robust successor to HumanEval.
5. Pillar 4: Applications
This section categorises real-world LLM-powered tools and models based on their primary function:
- Foundational (Completion & Generation): Includes tools like GitHub Copilot (code completion), Code Llama (a family of open models with fill-in-the-middle capabilities), and ToolGen (a framework to teach LLMs to use external autocompletion tools).
- Advanced (Search & Competitive Programming): Covers RepoRift (improves code search using RAG), CodeBERT (a bimodal model pre-trained on both natural and programming languages), and AlphaCode (a system designed for competitive programming that uses massive-scale generation followed by filtering and clustering).
- Auxiliary (Debugging & Translation): Showcases the capabilities of models like GPT-4 in debugging and code translation, Codex’s specific aptitude for bug fixing (evaluated on QuixBugs), and Flourine, a specialised tool that uses cross-language fuzzing to validate and refine LLM-based code translation.
15. MemoCoder: Automated function synthesis using LLM-supported agents
MemoCoder is a multi-agent framework for automated code generation that addresses the common failure of LLMs in tasks requiring iterative debugging and complex reasoning. Unlike standard zero-shot prompting or amnesiac self-repair loops, MemoCoder introduces a persistent memory mechanism and a supervisory agent to learn from past fixes across different tasks. The framework consists of four specialized agents (Planner, Code Writer, Test Executor, Mentor) and a central Fixing Knowledge Set. The key innovation is the Mentor Agent, which both retrieves relevant past fixes (a RAG-like function) and synthesizes high-level, reusable repair strategies from recurring error patterns. Experiments on MBPP, HumanEval, and LiveCodeBench show that MemoCoder consistently outperforms zero-shot and self-repair baselines, improving Pass@10 by up to 12.1% and Pass@50 by up to 14.5%, demonstrating the value of structured collaboration and persistent, cross-task learning.
1. Problem Statement
While LLMs excel at generating syntactically correct code for well-defined problems, they struggle with:
- Iterative Refinement: They often fail to correct their own logical errors, runtime exceptions, or infinite loops through simple self-correction prompts.
- Lack of Memory: Standard self-repair strategies are “amnesiac” — they don’t retain knowledge of how a specific type of error was fixed in a previous task. This forces the model to rediscover solutions repeatedly.
- Cost of Adaptation: Fine-tuning models to learn from new error patterns is computationally expensive and inflexible.
2. The MemoCoder Framework: Architecture and Workflow
MemoCoder decomposes the code generation process into a collaborative effort among four LLM-based agents, orchestrated around a persistent memory module.
Analogy: Imagine a software company with a junior developer, a QA engineer, a project planner, and a senior mentor.
- The junior developer (Code Writer) writes the code.
- The QA engineer (Test Executor) runs the tests.
- The planner (Planner) outlines the approach before coding starts.
- The senior mentor (Mentor Agent) has a vast “playbook” of past bugs and fixes (Fixing Knowledge Set). When the junior dev gets stuck, the mentor doesn’t just say “try again”; they consult the playbook for similar historical issues and offer concrete advice. Over time, the mentor also notices recurring mistakes and writes new, high-level guidelines for the whole team.
Core Components:
- Planner Agent: To mitigate LLM hallucination and anchor the reasoning process, this agent first generates three distinct, high-level algorithmic plans (in natural language) for solving the given problem.
- Code Writer Agent: This agent selects one of the plans and generates the initial Python function. It is also the agent responsible for iteratively fixing the code based on feedback.
- Test Executor Agent: A deterministic module that runs the generated code against a pre-defined “guiding assertion” test case. It classifies failures into four types: Compile Error, Exception, Assertion Failure, and Timeout.
- Fixing Knowledge Set (The Memory): A persistent database that stores successful repair instances as (buggy_code, error_context, fixed_code) tuples. This knowledge is accumulated from a large dataset (APPS) and grows during the evaluation phase.
- Mentor Agent (The Key Innovation): This supervisory agent has a dual role:
- Short-Term Guidance (RAG-like Retrieval): When the Test Executor reports a failure, the Mentor retrieves up to 10 semantically similar, previously successful fixes from the Fixing Knowledge Set. Similarity is determined by a Longest Sequential Matching heuristic on error messages. This provides the Code Writer with relevant examples for the current fix.
- Long-Term Strategy Refinement (Meta-Learning): The Mentor periodically analyzes the accumulated fixes for a given error type (e.g., Assertion Failure). It summarizes recurring error patterns and distills them into high-level, reusable “fixing suggestions.” This allows the system to generalize beyond simple example retrieval and improve its strategic approach over time.
Workflow:
- Plan: The Planner generates multiple strategies.
- Implement: The Code Writer generates the initial code from one plan.
- Test: The Test Executor runs the code against a guiding test case.
- Loop (on failure):
- The error type and message are sent to the Mentor.
- The Mentor retrieves similar past fixes and provides a high-level suggestion.
- The Code Writer receives the buggy code, error message, and the Mentor’s guidance to generate a revised solution.
- This loop repeats for up to 50 iterations.
- Learn: Once a problem is solved, the successful fix is added to the Fixing Knowledge Set.
3. Experimental Setup & Evaluation
- Base Models: LLaMA 3.1–8B-Instruct and Qwen 2.5–32B.
- Knowledge Accumulation: The Fixing Knowledge Set was bootstrapped by running MemoCoder on the APPS dataset.
- Evaluation Benchmarks:
- MBPP & HumanEval: Standard function synthesis benchmarks.
- LiveCodeBench (LCB): A more challenging and contamination-controlled benchmark. The authors specifically use a subset of problems confirmed to be released after the models’ training cut-off dates.
- Baselines:
- Zero-Shot: Direct code generation from the problem description
- Self-Repair: A standard iterative loop where the LLM tries to fix its own errors without external memory or specialized agents.
- Metric: Pass@k, the probability that at least one correct solution is found within k attempts.
4. Key Results and Findings
RQ1: Overall Effectiveness
MemoCoder consistently outperforms both Zero-Shot and Self-Repair baselines across all datasets, especially at higher values of k.
- Improvements in Pass@10 ranged from 3.1% to 12.1%.
- Improvements in Pass@50 ranged from 1.4% to 14.5%.
- This demonstrates that the guided, memory-augmented refinement process is significantly more effective than unguided self-correction, converging on correct solutions more reliably.
RQ2: Ablation Study (Impact of Components)
The study confirmed that each novel component plays a critical and distinct role.
- Without Planner: Pass@1 drops significantly. This shows that generating high-level plans is crucial for producing a strong initial solution.
- Without RAG (Mentor’s retrieval): Pass@10 and Pass@50 drop significantly. The example-based guidance from past fixes is vital for effective iterative repair.
- Without Error Pattern Analysis (Mentor’s strategy refinement): Pass@10 and Pass@50 also drop. This highlights the value of the Mentor’s meta-learning capability in providing generalized, high-quality fixing advice.
RQ3: Error Evolution Dynamics
Analysis of the repair process revealed interesting patterns:
- Initial Errors: Over 50% of initial generations fail to compile (Not Compiled), but many (17%) are resolved within the first two repair iterations.
- Persistent Errors: Not Compiled (78% persistence) and Test Failed (77% persistence) are the most stubborn error types, often remaining the same across multiple repair attempts.
- Error Transitions: Timeout errors have a high probability (38%) of transitioning into Test Failed errors. This suggests that the model’s attempt to fix an efficiency issue often introduces a logical bug, making the code faster but incorrect.
5. Conclusion and Implications
MemoCoder successfully demonstrates that an LLM-based agent system can achieve superior code generation performance by incorporating persistent memory and specialized, collaborative roles. The introduction of a Mentor Agent that both retrieves specific examples and synthesizes general strategies is the framework’s core contribution. This moves beyond simple self-repair to a more robust, learning-based system that accumulates and reuses knowledge, mimicking the learning process of a human development team.
output pair
16. Distilling LLM agent into small models with retrieval and code tools
This paper introduces Agent Distillation, a framework designed to transfer the task-solving capabilities of large, tool-using LLM agents to much smaller language models (sLMs). The core thesis is that standard Chain-of-Thought (CoT) distillation, which teaches sLMs to mimic static reasoning text, is fundamentally limited; sLMs trained this way often fail on tasks requiring up-to-date factual knowledge or precise computation, as they are forced to memorize reasoning patterns rather than learn a dynamic problem-solving process. Agent Distillation instead fine-tunes sLMs on interactive agent trajectories (thought, action, observation), teaching them how to use tools (retrieval for facts, code execution for math) rather than just what to say. The authors enhance this process with two key techniques: a first-thought prefix to improve the quality of teacher-generated trajectories and self-consistent action generation to increase the robustness of the student agent at inference time. Across eight factual and mathematical reasoning benchmarks, they show that this approach allows sLMs as small as 0.5B-3B to achieve performance competitive with, or even superior to, next-tier larger models (1.5B-7B) trained with CoT distillation, demonstrating a viable path toward creating efficient, capable, and practical sLM-based agents.
1. Problem Statement
The primary challenge is to bridge the performance gap between computationally expensive LLMs and more practical sLMs without losing complex reasoning abilities. The dominant approach, CoT distillation, trains an sLM to reproduce the step-by-step textual reasoning of a larger “teacher” LLM. However, this method has two critical weaknesses:
- Factual Hallucination: sLMs have limited parametric knowledge. When distilled on CoT traces that rely on the teacher’s internal knowledge, the sLM learns to “imitate knowing” but will often hallucinate when faced with new questions requiring facts it hasn’t memorized.
- Computational Inaccuracy: sLMs struggle with precise multi-step arithmetic and symbolic manipulation. Simply mimicking a CoT trace that includes calculations does not imbue the sLM with the actual ability to compute.
Essentially, CoT distillation teaches a model to talk like a reasoner, whereas Agent Distillation teaches it to act like one by offloading knowledge and computation to external tools.
2. The Agent Distillation Framework
The core idea is to shift the distillation target from a static text rationale to a dynamic, interactive trajectory.
Analogy: Imagine teaching a culinary student.
- CoT Distillation is like giving the student a recipe (a static text trace). The student memorizes the steps: “Add 2 tsp salt, then stir for 3 mins.” They can replicate this specific recipe, but they don’t learn why or what to do if an ingredient is missing.
- Agent Distillation is like an apprenticeship. The master chef says, “First, I need to check if we have salt” (thought), goes to the pantry (action: retrieve), and sees there is no salt (observation). “Okay, no salt. I’ll use soy sauce as a substitute and adjust the quantity” (new thought). The student learns the process of problem-solving: how to identify knowledge gaps, use tools to fill them, and adapt based on the results.
Workflow:
- Teacher Trajectory Generation: A large teacher model (Qwen2.5–32B) is prompted to act as an agent (using a framework like CodeAct). It generates trajectories composed of (thought, action, observation) cycles to solve a problem. The action is a code snippet that can call a retrieval tool or perform calculations, and the observation is the output from executing that code.
- Student Fine-Tuning: A smaller student model is fine-tuned on these trajectories. The loss function is autoregressive next-token prediction, but critically, it is only applied to the tokens generated by the model (thought and action). The observation tokens, which come from the environment, are masked from the loss.
Key Technical Contributions:
The authors introduce two novel techniques to overcome challenges in this process:
- First-Thought Prefix (FTP): They observed that standard agent prompting can lead to suboptimal, action-oriented initial reasoning steps. In contrast, CoT prompting elicits better, more structured initial plans. FTP is a prompting method for the teacher that combines the best of both worlds:
- First, generate the initial reasoning step using a CoT prompt (e.g., “To solve this, I first need to find…”).
- Then, use this text as a prefix for the agent’s first thought, guiding it to start with a high-quality plan before generating actions.
This improves the quality of the training trajectories being distilled. - Self-Consistent Action Generation (SAG): This is a test-time decoding strategy to make the student agent more robust. Instead of greedily decoding a single action, the agent:
- Samples N diverse thought-action candidates using nucleus sampling.
- Executes each candidate’s code snippet in an interpreter, filtering out any that produce syntax or runtime errors.
- Performs majority voting on the observations (execution results) from the valid candidates and selects the action that produced the most consistent output.
This uses test-time compute to filter out invalid code and increase the probability of a correct action.
3. Experimental Setup & Evaluation
- Models: Qwen2.5–32B-Instruct as the teacher; Qwen2.5-Instruct models (0.5B, 1.5B, 3B, 7B) as students.
- Tasks: Eight benchmarks across two domains:
- Factual Reasoning: HotpotQA (in-domain), Bamboogle, MuSiQue, 2WikiMultiHopQA (out-of-domain).
- Mathematical Reasoning: MATH (in-domain), GSM-Hard, AIME, OlymMATH (out-of-domain).
- Baselines:
- CoT Prompting (zero-shot).
- CoT Distillation.
- CoT Distillation + RAG (a strong baseline where retrieval is added to the CoT process).
- Tools: A Wikipedia 2018 dump for the retrieval tool and a standard Python interpreter for the code tool.
4. Key Results and Findings
- Agent Distillation Outperforms CoT Distillation: The distilled agents consistently beat CoT-distilled models of the same size, especially on out-of-domain tasks. This validates the hypothesis that learning to use tools provides better generalization than memorizing static rationales.
- sLMs Punch Above Their Weight: Agent Distillation enables sLMs to match or exceed the performance of CoT-distilled models that are 2–4x larger. For example, the 0.5B agent matches the 1.5B CoT model, and the 3B agent surpasses the 7B CoT model on average.
- FTP and SAG Provide Significant Gains: Both proposed techniques demonstrably improve performance. Analysis shows FTP is particularly helpful for complex, multi-step problems (e.g., high-difficulty MATH questions) by encouraging better initial planning. SAG is shown to drastically reduce the rate of parsing and execution errors in the generated code, especially for the smallest models.
- Tool Usage Dynamics:
- Larger distilled agents tend to make more retrieval calls, suggesting they learn a better policy for when to seek external information.
- The first-thought prefix can sometimes reduce retrieval calls, as it encourages the model to generate facts in its reasoning. This is a double-edged sword: it can be more efficient but also increases the risk of hallucination.
- Efficiency: Agent Distillation does not lead to a significant increase in token generation. Agents generate more tokens on factual tasks (due to multiple retrieval calls) but fewer on math tasks (by offloading complex calculations to concise code loops).
5. Conclusion and Implications
Agent Distillation presents a powerful and practical alternative to CoT distillation for creating capable sLMs. By teaching models to interact with tools, it effectively outsources the weaknesses of sLMs (limited knowledge, poor calculation) to reliable external modules. This allows the distillation process to focus on transferring the core, generalizable skill: the reasoning process required to break down a problem and decide which tool to use. The work demonstrates that even sub-billion parameter models can be endowed with sophisticated agentic behaviors, paving the way for more powerful, efficient, and deployable AI on edge devices and in resource-constrained environments.
from LLMs, but often fails when new knowledge or precise computation is needed at test time. Our proposed
agent distillation instead teaches student models to think and act — e.g., retrieve facts or execute code — offering
stronger generalization and better robustness to hallucination.
The first reasoning step is used as a prefix to generate an agentic trajectory, which is then distilled to a student
agent to teach CoT-style reasoning initialization. (b) Self-consistent Action Generation: The agent generates
multiple candidate actions and selects the one with consistent outcomes. Thoughts are omitted for brevity.
domain tasks, compared to baselines. ftp = First-Thought Prefix, sag = Self-consistent Action Generation.
Highlighting best among same-sized models. Avg. denotes the average score across all tasks.
and 32B-Coder/1.5B-Coder denote code-specific models. For all models, we apply sag with N = 8.
17. From provable correctness to probabilistic generation: A comparative review of program synthesis paradigms
This bachelor’s thesis provides a comprehensive and chronological review of the field of program synthesis, charting its evolution through five major paradigms. The narrative traces the field’s central tension: the trade-off between the correctness guarantees of a synthesized program and the specification burden placed on the user.
The paper begins with Logic-Based (Deductive) Synthesis, which offers provably correct code but requires complete, formal logical specificationsa significant bottleneck. It then moves to Inductive Synthesis (PBE), which dramatically lowers the specification burden by learning from simple input-output examples, but at the cost of sacrificing correctness guarantees. The third paradigm, Sketch/Schema-Based Synthesis, presents a middle ground, enabling a human-computer synergy where the programmer provides a high-level structural “sketch” and an automated solver fills in the low-level details.
The review then documents the recent paradigm shift initiated by Large Language Models (LLMs), which leverage massive code corpora to generate programs from ambiguous natural language prompts. This approach offers unprecedented accessibility but introduces fundamental challenges regarding reliability and correctness, which the paper terms the “correctness impasse.” Finally, the paper culminates in an analysis of Neuro-Symbolic Hybrids, the current research frontier. This paradigm seeks to achieve the best of all worlds by integrating the pattern-matching and generative power of neural models (especially LLMs) with the rigor and verifiability of symbolic systems, often in a “verifier-in-the-loop” architecture.
1. Logic-Based (Deductive) Program Synthesis
This paradigm is the classical, most rigorous approach to program synthesis, aiming for provable correctness.
- Core Principle: Program synthesis is framed as a theorem-proving task. The specification is a formal logical formula, typically of the form ∀x ∃z R(x, z) (for all inputs x, there exists an output z satisfying relation R). The synthesizer’s job is to find a constructive proof of this theorem.
- Mathematical Foundation: The “Proofs-as-Programs” Paradigm (Curry-Howard Correspondence): This is the central mechanism.
- Analogy: Imagine a chef trying to create a new dish (the program) that meets a complex dietary specification (the proposition ∀x ∃z R(x, z)). The Curry-Howard correspondence states that writing a detailed, step-by-step, logically sound recipe (a formal proof) is equivalent to creating the dish itself (the program). The inference rules used in the recipe (like induction or case analysis) directly map to control structures in the final program (like recursion or if-then-else statements). The final code is simply “extracted” from the structure of the completed proof.
- Key Systems:
- KIDS (Kestrel Interactive Development System): Focused on correctness-preserving transformations, allowing a user to refine a declarative spec into an efficient algorithm.
- Coq Proof Assistant: A modern tool where users write formal specs and interactively construct proofs, from which functional code (e.g., OCaml, Haskell) can be automatically extracted. The CompCert C compiler is a landmark achievement of this approach.
- Theorema System: Highlighted for advanced techniques like using multisets to formally specify the “permutation” property in sorting algorithms, a non-trivial formalization challenge.
- Strengths & Weaknesses:
- Strength: Unparalleled guarantee of correctness. The generated code is correct by construction.
- Weaknesses: The Specification Obligation. Writing a complete and correct formal specification is often as hard or harder than writing the code itself. The process is also difficult to automate for complex problems, requiring significant human guidance.
2. Inductive Program Synthesis: Generalization from Examples (PBE)
This paradigm shifts the focus from formal proof to intuitive specification, learning general programs from concrete examples.
- Core Principle: Instead of a formal spec, the user provides a set of input-output examples (e.g., Input: “john f. kennedy”, Output: “J. F. Kennedy”). The system’s task is to search a space of possible programs to find one that is consistent with the examples and, ideally, generalizes to new inputs. This is an act of inductive reasoning — a leap from specific instances to a general rule.
Methodological Underpinnings:
Domain-Specific Languages (DSLs): The key to making the search tractable. Instead of searching through all possible Python programs, the search is constrained to a small language with relevant primitives (e.g., Substring, Concatenate for string manipulation).
- Analogy: A DSL is like giving a LEGO builder only the specific types of bricks needed to build a car (wheels, steering wheels, chassis pieces) instead of every LEGO brick ever made. This drastically reduces the search space for a valid car design.
Search Algorithms: The paper details a progression from simple enumerative search to more advanced techniques.
- Version Space Algebra (VSA): A core technique in FlashFill. Instead of enumerating individual programs, VSA manipulates a compact data structure that implicitly represents the set of all programs consistent with the examples. It’s an algebraic way to compose sets of potential sub-solutions efficiently.
Counterexample-Guided Inductive Synthesis (CEGIS): A powerful feedback loop architecture.
- Analogy: CEGIS is like a game between a “Proposer” and a “Refuter.” The Proposer (Generator) suggests a program that fits all known examples. The Refuter (Verifier/User) checks this program against a broader specification (or just common sense) and, if it’s wrong, provides a counterexample where it fails. This counterexample is added to the set of examples, and the Proposer must now find a program that works for the old examples and the new one. This loop refines the specification iteratively until a correct program is found.
- PROSE Framework: A meta-framework that uses witness functions (inverse semantics) to perform efficient top-down deductive search.
- Analogy: A normal function is like a blender: you put in “banana” and “milk” and get a “smoothie.” Its inverse witness function is like a food scientist who looks at the “smoothie” and deduces all possible input pairs that could have produced it (e.g., (“bana”, “na milk smoothie”), (“banan”, “a milk smoothie”), etc.). This allows the search to work backwards from the desired output, drastically pruning the search space.
Key Systems:
- FlashFill (Microsoft Excel): The canonical success story for PBE, automating string transformations for millions of end-users.
- FlashExtract: Extends these ideas to extract structured data from semi-structured text.
Strengths & Weaknesses:
- Strengths: High accessibility for non-programmers, solves the “specification bottleneck” for common, repetitive tasks.
- Weaknesses: The Ambiguity Problem (many programs can explain a few examples), scalability challenges, and a fundamental lack of correctness guarantees. The inductive leap is logically unsound.
3. Program Synthesis via Sketches and Schemas
This paradigm offers a pragmatic middle path, combining human insight with automated combinatorial search.
- Core Principle: The user provides a partial program, or sketch, which captures the high-level algorithmic structure but leaves low-level details as “holes” for the synthesizer to fill. This is a form of human-computer synergy.
- Analogy: A sketch is like a game of Mad Libs for programming. The programmer writes the story’s overall structure (the loops, the function calls) but leaves blanks (??) for the synthesizer to fill in with the correct constants, expressions, or variable names to make the story coherent (i.e., pass the tests).
- Formalism: Syntax-Guided Synthesis (SyGuS): This framework unifies sketch- and schema-based approaches. A SyGuS problem is defined by:
- A Semantic Specification (Logic): A set of constraints or tests the final program must satisfy.
- A Syntactic Specification (Grammar): A context-free grammar that defines the search space of allowed programs. A sketch is simply a very specific instance of such a grammar.
- Key Systems:
- Sketch: The original toolchain, using a C-like language to write partial programs and a SAT-based backend to solve for the holes.
- Rosette: A solver-aided language embedded in Racket, providing a more general framework for building synthesis and verification tools.
- AUTOBAYES/AUTOFILTER: Examples of schema-based synthesis, using high-level templates (schemas) of statistical algorithms to generate specialized scientific code.
- Strengths & Weaknesses:
- Strengths: Makes synthesis tractable for complex domains (e.g., cryptography, concurrency) by allowing programmers to guide the search.
- Weaknesses: The quality of the sketch is critical. A poorly designed sketch can make the problem unsolvable or lead to timeouts. Crafting good sketches requires significant expertise.
4. Large Language Models as Program Synthesizers
This paradigm represents a major shift, treating program synthesis as a probabilistic translation task from natural language to code.
- Core Principle: Based on the “naturalness of software” hypothesis, LLMs trained on vast code corpora (like GitHub) learn the statistical patterns of code. They can then generate syntactically plausible and often semantically correct code from unstructured prompts (e.g., “write a Python function to download a file from a URL”).
- Mechanics:
- Pre-training & Fine-tuning: A foundation model is pre-trained on a massive mix of text and code, then fine-tuned on a high-quality, curated dataset of code-related tasks.
- Autoregressive Generation: The model generates the program token-by-token, with each new token being predicted based on the prompt and all previously generated tokens.
- Key Methodologies:
- Large-Scale Sampling & Filtering: Since generation is probabilistic, the dominant strategy is to generate thousands or millions of candidate programs and use a simple verifier (like unit tests) to filter out incorrect ones.
- Analogy: This is like a massive brainstorming session. You ask the LLM for a solution and get a flood of ideas. You then apply a quick sanity check (the tests) to discard 99% of them, and from the remaining plausible ideas, you select the most promising one.
- Iterative Refinement (Self-Debugging): Modern approaches use a feedback loop. The LLM generates code, a tool (compiler, test runner) executes it, and the feedback (e.g., an error message) is fed back into the LLM’s prompt, asking it to fix the bug.
- Key Systems & Evaluation:
- OpenAI Codex: The model behind GitHub Copilot, bringing LLM-based synthesis to the masses.
- DeepMind AlphaCode: Demonstrated competitive-level performance in programming competitions by pioneering the massive sampling/filtering/clustering technique.
- Benchmarks: The field shifted to functional correctness metrics like pass@k on benchmarks like HumanEval and MBPP. pass@k measures the probability that at least one of k generated samples is correct.
- Strengths & Weaknesses:
- Strengths: Unprecedented flexibility in handling ambiguous, natural language specs; massive productivity gains; broad language and domain coverage.
- Weaknesses: The Correctness Impasse. The generated code is probabilistic and comes with absolutely no formal guarantees of correctness. It can contain subtle bugs, security vulnerabilities, and “hallucinated” API calls. Verification by the human developer remains critical.
5. Neuro-Symbolic (Hybrid) Synthesis
This paradigm is the current research frontier, aiming to unify the strengths of neural generation with the rigor of symbolic reasoning.
- Core Principle: To create systems that are both robust and interpretable by combining neural networks (for perception, pattern matching, and heuristic guidance) and symbolic systems (for structured reasoning, verification, and constraints).
- A Taxonomy of Architectures:
- Symbolic[Neuro]: A primary symbolic algorithm (e.g., a search tree) uses a neural network as a subroutine to provide a learned heuristic (e.g., AlphaGo).
- Neuro|Symbolic: A pipeline where a neural front-end performs perception (image -> objects) and a symbolic back-end performs reasoning on the results.
- Neuro:Symbolic→Neuro: Symbolic knowledge (e.g., logical rules) is compiled into the training process of a neural network to constrain it.
- Foundational Methodologies:
- Neural-Guided Search (e.g., DeepCoder): An early approach where a neural net predicts which DSL components are likely to be in the solution, guiding a traditional symbolic search.
- Library Learning (e.g., DreamCoder): A system that learns its own reusable functions (abstractions).
- Analogy: DreamCoder’s “wake-sleep” cycle is like an expert programmer. During the “waking” phase, it solves problems. During the “sleep” phase, it reflects on its solutions, identifies repeated patterns, and refactors them into new, named functions in its personal library. It then retrains itself to use these new, more powerful abstractions, allowing it to tackle harder problems in the next cycle.
- The LLM Influence: The Verifier-in-the-Loop Pattern:
- This has become the dominant modern neuro-symbolic architecture. An LLM acts as a powerful but fallible Generator, proposing programs. A symbolic tool (a compiler, test suite, or formal verifier) acts as an infallible Verifier. This creates a powerful feedback loop (like CEGIS) that leverages the LLM’s generative capacity while grounding its output in symbolic correctness.
- Symbolic Scaffolding (e.g., Proof of Thought): Instead of just verifying the final output, this approach forces the LLM to externalize its intermediate reasoning steps into a structured, verifiable program, which is then checked by a symbolic logic engine.
- Open Challenges: The paper concludes by noting key challenges, including the “explainability paradox” (even if the output program is simple, the neuro-symbolic process to get there can be opaque), scalability, and the need for unified frameworks. The ultimate goal is to create systems that can learn, reason, and collaborate with humans in a more robust and trustworthy manner.
systems
paradigms.
Systems.
18. Agentic program repair from test failures at scale: A neuro-symbolic approach with static analysis and test execution feedback
This industrial paper from Meta presents “Engineering Agent,” a large-scale, autonomous system for automated program repair (APR) deployed in production. The system tackles the real-world problem of fixing bugs that manifest as test failures within Meta’s massive, fast-moving monorepo. It employs a neuro-symbolic, agentic architecture where a Llama-based model, operating within a ReAct-style loop, acts as the neural “reasoning” engine. This agent interacts with a suite of 15 symbolic tools including static analyzers and test execution runners to diagnose failures, propose code patches, and iteratively refine its solutions based on deterministic feedback.
Key research findings from their offline benchmarks demonstrate the superiority of this hybrid approach: the agent’s solve rate jumped from 28.5% (neural-only) to 42.3% when provided with symbolic feedback from static analysis and test execution. The paper also offers practical insights for the field, such as the finding that a “search-and-replace” patch format is significantly more effective for LLMs than the standard unified diff format.
In a three-month production deployment, the Engineering Agent achieved a 25.5% end-to-end land rate for its generated fixes, demonstrating the viability of agentic APR at industrial scale. The work provides a robust blueprint for building and evaluating such systems, emphasizing the critical role of symbolic feedback, quality control via an “LLM-as-a-Judge,” and the value of human-in-the-loop collaboration.
1. System Architecture: An Orchestrated Neuro-Symbolic Loop
The paper frames the problem as finding a Patch that satisfies an Oracle (the failing test suite) given a Specification (the test failure report). The core of their solution is an agentic system that iteratively refines its understanding and solution.
- The Case File (Specification): The process begins when a rule-based bot (TFMB) files a “case” — a detailed report of a test failure, complete with a stack trace and the likely culprit code change.
- The Detective (ReAct Agent): A Llama model acts as the detective. It uses a ReAct (Reason + Act) framework to formulate a hypothesis (thought) and then choose a tool to test it (action). For example, Thought: The stack trace points to xyz.py. I should examine the contents of that file. -> Action: read_file(file_path=’xyz.py’).
- The Detective’s Toolkit (Symbolic Tools): The agent has access to 15 deterministic, symbolic tools. These range from simple file system operations (read_file, find_file) to complex code intelligence tools (search_method_in_class) and, most importantly, validators.
- The Forensics Lab (Feedback Oracle): The crucial neuro-symbolic connection happens here. After the agent proposes a fix with the edit tool, the “forensics lab” runs static analysis and test execution. The deterministic output — lint errors, compiler warnings, or new stack traces — is fed back to the agent as a structured observation. This grounds the LLM’s probabilistic reasoning in factual, symbolic feedback, enabling it to course-correct in a tight loop. The balanced model takes an average of 11.8 feedback iterations per solution.
- The District Attorney (LLM-as-a-Judge): Before a proposed fix is sent to a human, it’s vetted by another LLM calibrated to act as a quality judge. Its job is to filter out patches that, while technically passing the tests, are suboptimal, use legacy patterns, or violate engineering best practices. This step is optimized for high precision in rejecting “unacceptable” patches (0.867 precision) to avoid wasting human reviewers’ time and eroding trust.
- The Grand Jury (Human Reviewer): The final step is human review. The engineer is presented with the patch, the passing test results, and the agent’s full “trajectory” (the log of thoughts and actions) to understand its reasoning process before landing the change.
2. Key Methodological Contributions & Offline Findings
The paper details several crucial findings from extensive offline benchmarking, which guided the production design.
- Patch Format Matters Significantly: The standard unified diff format confuses LLMs, often causing them to hallucinate line numbers. A simpler “search-and-replace” format proved far more natural and effective, boosting the solve rate of the Llama-405B model from 30% to 53% on the PatchGen benchmark. This is a highly practical finding for any AI code generation research.
- Specialized Models Outperform General Giants: A smaller, internally fine-tuned 70B iCodeLlama model was highly competitive with (and in some cases, superior to) the much larger, general-purpose 405B Llama model. This highlights the immense value of domain-specific fine-tuning for code-related tasks, offering a path to better performance with lower inference costs.
- Quantifying the Value of Neuro-Symbolic Integration (Ablation Study): This is the paper’s central research contribution.
- The pure neural agent (ReAct only) had a solve rate of 28.5%.
- Adding symbolic feedback from static analysis tools boosted it to 34.1%.
- Adding symbolic feedback from test execution had the largest impact, raising the solve rate to 43.9%.
- Combining both yielded a balanced 42.3% solve rate with a lower error rate, demonstrating that the two forms of symbolic feedback are complementary. This provides strong quantitative evidence that grounding LLM agents with deterministic tool feedback is critical for complex SE tasks.
3. Production Deployment: Performance and Human Feedback
The system was deployed to handle live test failures at Meta.
- Production Metrics (3-Month Period):
- Published Diffs: 1,589 automated fixes were proposed.
- Review Rate: A high 80% of these were reviewed by engineers, indicating the generated fixes were of sufficient quality to warrant human attention.
- Land Rate: 31.5% of the reviewed diffs were accepted and landed, corresponding to 25.5% of the total generated diffs. This is a strong result for a fully autonomous APR system tackling complex, non-trivial bugs.
- Qualitative Feedback from Engineers: Open coding of engineer comments revealed crucial nuances that quantitative metrics miss:
- Positive: Engineers expressed gratitude for time saved, surprise at the agent’s capability, and noted that even partially correct solutions were highly valuable as starting points for a manual fix. This challenges the binary “pass/fail” evaluation common in APR research.
- Challenges & System Improvements: Negative feedback was instrumental in hardening the system. Complaints about test flakiness led to the creation of more isolated execution environments. The difficulty of finding appropriate human reviewers for cross-cutting changes prompted organizational adjustments. Feedback about missing tools led to the expansion of the agent’s action space.
4. Contribution and Significance for AI Experts
This work moves beyond benchmark-driven APR research to a real-world, large-scale deployment. Its primary contributions are:
- An Architectural Blueprint: It provides a detailed and validated architecture for building neuro-symbolic engineering agents that combine the reasoning power of LLMs with the reliability of symbolic tools.
- Strong Empirical Evidence: It delivers clear, quantitative proof of the performance gains from integrating symbolic feedback (static analysis, test execution) into an agentic loop.
- Practical Engineering Insights: It offers actionable lessons for the AI community on crucial but often-overlooked details, such as the importance of patch format design and the utility of an LLM-as-a-Judge for quality control.
- Human-AI Collaboration Model: It demonstrates that the value of AI agents is not just in full automation but also in their ability to act as collaborators, providing partially correct solutions that significantly accelerate human workflows.
19. CodeCoR: An LLM-based self-reflective multi-agent framework for code generation
CodeCoR (Code Collaboration and Repair) is a self-reflective, multi-agent framework for LLM-based code generation that significantly improves upon existing sequential multi-agent systems. Its core innovation is tackling the problem of error propagation, where a mistake made by an early agent (e.g., in task understanding) cascades and ruins the entire generation process.
CodeCoR achieves this through a parallelized, self-evaluating workflow involving four specialized agents: Prompt, Coding, Test, and Repair. Instead of a single linear path, each agent generates a pool of potential outputs (prompts, code snippets, test cases, repair advice), and then uses an LLM-powered pruning mechanism to discard low-quality candidates. This ensures only the most promising artifacts proceed, making the system robust and resilient. The framework also incorporates an iterative repair loop where a dedicated Repair Agent provides explicit advice to fix bugs. Experiments show CodeCoR achieves a state-of-the-art average Pass@1 score of 77.8% across four benchmarks, substantially outperforming models like MapCoder and CodeCoT.
1. Problem Statement
Current LLM-based code generation, including multi-agent frameworks like CodeCoT and MapCoder, often follows a sequential workflow. This process typically involves:
- An agent analyzes the task and creates a plan (or a better prompt).
- A coding agent generates code based on that plan.
- A testing agent validates the code.
The critical weakness is error propagation. If the first agent misunderstands the natural language requirement, this misunderstanding is baked into the plan. The coding agent will then faithfully generate incorrect code, and the testing agent might even generate tests for the wrong functionality. The entire chain of effort is wasted because of an initial mistake. The system lacks the robustness to question or validate the quality of intermediate steps.
2. The CodeCoR Framework: A Self-Reflective Approach
CodeCoR is designed to be “self-reflective,” meaning it constantly evaluates the quality of its own work at each stage. It consists of four agents and a five-phase workflow.
The Four Agents:
- Prompt Agent: Takes the initial task description and generates multiple, detailed step-by-step plans using the Chain-of-Thought (CoT) technique. It creates a pool of potential plans, not just one.
- Coding Agent: Receives a high-quality prompt (plan) and generates multiple code snippets that attempt to implement it, creating a code snippet pool.
- Test Agent: Also uses the high-quality prompt to generate a pool of test cases to validate the code’s correctness.
- Repair Agent: Analyzes code that fails the generated test cases and provides explicit, natural language repair advice on how to fix the bugs.
The Five-Phase Workflow:
- Phase I: Prompt Generation: The Prompt Agent generates a CoT_pool. The prompts are then pruned to keep only the highest-quality ones.
- Phase II: Test Case Generation: Using the pruned prompts, the Test Agent generates a test_case_pool, which is also pruned to remove invalid or redundant tests.
- Phase III: Code Generation: In parallel with Phase II, the Coding Agent uses the same pruned prompts to generate a code_snippet_pool. Syntactically incorrect code (that can’t be compiled) is pruned.
- Phase IV: Result Checking: Each promising code snippet is executed against the pool of test cases in a local environment.
- Pass: The code is added to a ranked_code_set.
- Fail: The framework determines if the code is salvageable. If it fails the same tests repeatedly across repair rounds, it’s deemed un-repairable and added to the ranked set as-is. Otherwise, it’s sent for repair.
- Phase V: Code Repairing:
- The Repair Agent analyzes the failed code and test cases, generating repair advice. This advice is also pruned for quality.
- The pruned advice and the faulty code are sent back to the Coding Agent, which generates a revised code snippet.
- This repaired snippet re-enters the workflow at Phase IV for re-testing, creating an iterative feedback loop.
Finally, the code from the ranked_code_set that passes the highest number of test cases is selected as the final output.
3. Explanation of Complex Parts with Analogies
The most complex and innovative part of CodeCoR is its self-reflective pruning and repair mechanism. It’s not based on complex mathematics but on a sophisticated workflow and meta-evaluation.
Analogy: A High-Performing Software Development Team
Imagine building a complex software feature, not with one junior developer, but with a team of specialized experts who constantly review their own work.
- The Task: “Build a feature that calculates the decimal part of a number.”
- Traditional Sequential Approach (e.g., MapCoder): A single project manager writes one specification. A single developer writes code based on it. A single QA engineer writes tests. If the initial spec was ambiguous (“what about negative numbers?”), the whole product is flawed.
- The CodeCoR Team Approach:
- The Lead Architects (Prompt Agent): Instead of one spec, three senior architects each write a detailed design document (CoT prompt).
- Doc A: “1. Take input n. 2. Convert to string. 3. Find the decimal point. 4. Return the substring after it.”
- Doc B: “1. Take input n. 2. Get the integer part using int(n). 3. Subtract the integer part from n. 4. Handle negative numbers by taking the absolute value first.”
- Doc C: “Get decimal.” (This one is too vague).
- Architectural Review (Prompt Pruning): The architects review all three docs. They use a clear checklist (the “pruning prompt”): Is it Clear? Relevant? Concise? Contextual?
- Docs A and B pass. Doc C is pruned for being unclear and lacking context. This is the first layer of self-reflection.
- Parallel Development (Coding & Test Agents):
- Dev Team (Coding Agent): Two developers get the approved docs (A and B) and each write a version of the code. Now you have two distinct, promising code candidates.
- QA Team (Test Agent): Simultaneously, two QA engineers use the same approved docs to write comprehensive test suites, including edge cases like test(0.5), test(-3.14), test(10.0).
- CI/CD Pipeline (Result Checking): The code from both developers is run against the test suites.
- Developer A’s code (string-based) fails on negative numbers.
- Developer B’s code (math-based) passes all tests. It gets a high rank.
- Code Review & Debugging (Repair Agent & Loop):
- Developer A’s failing code goes to a Senior Debugger (Repair Agent). Instead of just saying “it’s broken,” the debugger provides specific advice: “Your string manipulation logic doesn’t account for the ‘-’ sign in negative numbers. You should first check if the number is negative, handle the sign, and then process the string.”
- This advice is checked for quality (Repair Pruning). If the advice was bad (“try again”), it would be replaced with the raw failed test cases.
- Developer A takes this explicit advice, fixes the code, and resubmits it to the pipeline. This is the iterative repair loop.
This team analogy illustrates CodeCoR’s strengths:
- Resilience: A bad idea (Doc C) is eliminated early.
- Diversity: Multiple valid approaches (A and B) are explored in parallel.
- Collaboration: The debugger provides constructive feedback, making the repair process intelligent rather than random.
- Self-Reflection: At every stage, agents (team members) evaluate their own output against quality criteria before passing it on.
4. Key Experimental Findings
- Overall Performance (RQ1): CodeCoR significantly outperforms all baselines on HumanEval, MBPP, and their extended versions (ET). On HumanEval, it achieves 86.6% Pass@1, compared to MapCoder’s 80.5% and CodeCoT’s 79.3%.
- Code Quality: Beyond just passing tests (functional correctness), CodeCoR’s output is structurally closer to human-written reference solutions. It achieved a lower (better) Average Edit Distance and a higher (better) Average BLEU score, indicating higher syntactic and semantic quality.
- Component Importance (RQ2 — Ablation Study): Removing any single component crippled the system’s performance. The most drastic drop occurred when removing the Test Agent (Pass@1 on HumanEval fell from 86.6% to 45.1%), proving that self-generated, high-quality tests are crucial for validation and repair. Removing the Prompt Agent or Repair Agent also caused significant performance degradation.
- Efficiency and Cost (RQ3): Despite its complex workflow, CodeCoR is more efficient than many single-agent, chain-of-thought methods. Its runtime was ~123s, compared to ~250s for methods like SCoT and Self-Planning. This is attributed to the pruning mechanism, which prevents the system from wasting computational resources on unpromising paths, and the parallel nature of the agents.
- Generalizability: The framework is model-agnostic. When tested with GPT-4 and CodeLlama, CodeCoR consistently outperformed other methods, achieving an impressive 94.5% Pass@1 on HumanEval with GPT-4.
5. Conclusion
CodeCoR presents a paradigm shift from sequential, brittle multi-agent systems to a robust, self-reflective, and collaborative framework. By generating and pruning pools of artifacts at each step and employing an intelligent repair loop, it effectively mitigates error propagation and produces higher-quality code more efficiently. Its architecture mimics the collaborative and iterative nature of expert human software development teams, setting a new standard for LLM-based code generation.
datasets (Pass@1 scores). The best approach is highlighted in bold
Average BLEU score, including the mean values across both datasets.
Agent. The first sub-figure gives a coding task. The second sub-figure shows code produced by a single agent
without a Repair Agent, which contains a semantic error. The third sub-figure shows the code generated
under the guidance of the Repair Agent, which is correct both syntactically and semantically
20. Paper2Code: Automating code generation from scientific papers in machine learning
PaperCoder is a multi-agent LLM framework designed to address the scientific reproducibility crisis by automatically generating entire machine learning code repositories directly from scientific papers. Unlike previous systems that generate code from structured prompts or require existing code snippets, PaperCoder tackles the highly complex and unstructured nature of a full research paper. It achieves this by mimicking a human developer’s top-down software engineering workflow, breaking the task into three distinct stages: Planning, Analysis, and Coding. The novel Planning stage is the most critical, where specialized agents create a high-level implementation roadmap, design the system architecture with class and sequence diagrams, establish file dependencies, and generate configuration files before a single line of code is written. This structured approach ensures the final repository is modular, coherent, and faithful to the paper’s methodology. Evaluations on their new Paper2CodeBench and the existing PaperBench show that PaperCoder substantially outperforms baselines. Notably, human experts (the papers’ original authors) rated 88% of PaperCoder’s repositories as the best, and the generated code is nearly executable, requiring modification to only ~0.81% of lines on average.
1. Problem Statement
A major bottleneck in scientific progress, particularly in machine learning, is the lack of available code for published research (the paper finds only ~19.5% of papers at top 2024 conferences provide code). This forces researchers to spend significant time reverse-engineering methods, slowing down validation and innovation.
The “naive” approach of feeding an entire research paper to a powerful LLM and asking for the code fails for several reasons:
- Complexity & Ambiguity: Scientific papers are written for human communication, not as software specifications. They are filled with narrative, motivation, and high-level concepts that are noisy from a software engineering perspective.
- Long-Context Limitations: Even with large context windows, LLMs struggle to maintain global consistency, track inter-file dependencies, and adhere to a coherent architecture across an entire repository generated in a single pass.
- Lack of Structure: A naive generation is likely to produce a monolithic script or a poorly organized set of files with circular dependencies and other structural flaws.
2. The PaperCoder Framework: A Top-Down, Multi-Agent Approach
PaperCoder tackles this challenge by imposing a structured, top-down workflow, orchestrated by specialized LLM agents across three sequential stages.
Stage 1: Planning (The Blueprint Phase)
This stage is the core innovation. Instead of diving into code, a series of agents transform the unstructured paper into a detailed software blueprint. This is broken into four sequential sub-steps:
- Overall Plan: An agent reads the entire paper and extracts the high-level components to be implemented — the model architecture, data processing pipeline, training procedure, and evaluation metrics. This creates a foundational “what to build” document.
- Architecture Design: Using the overall plan, another agent designs the repository’s structure. It generates a file list, a class diagram (modeling classes and their static relationships), and a sequence diagram (modeling the dynamic interactions and call flow between components). This defines the system’s static and dynamic architecture.
- Logic Design: This agent translates the architectural design into an actionable, dependency-aware plan. It produces an ordered file list, which dictates the sequence for code generation to prevent dependency errors (e.g., generate utils.py before model.py that imports from it). It also elaborates on the logic within each file.
- Configuration Generation: A final planning agent scans the paper and the plan to extract all key hyperparameters (learning rate, batch size, model dimensions, etc.) and synthesizes a config.yaml file. This separates configuration from code, making the repository modular and easy for researchers to tweak.
Stage 2: Analysis (The Task Breakdown Phase)
With the complete blueprint from the Planning stage, an Analysis Agent iterates through each file defined in the plan. For each file, it generates a detailed, natural-language specification of what needs to be implemented. This includes its functional goals, required inputs/outputs, dependencies on other modules, and specific algorithmic details derived from the paper. This phase acts as a bridge between the high-level architecture and the low-level code.
Stage 3: Coding (The Implementation Phase)
A Coding Agent generates the final code repository. The key here is that the process is sequential and context-aware:
- It generates files one by one, strictly following the execution order determined in the Logic Design phase.
- When generating the i-th file, the agent is provided with all contextual information: the original paper, the full plan, the file-specific analysis, and the code for all previously generated files (1 to i-1).
This iterative process ensures that imports are valid and that cross-file interactions are consistent.
3. Explanation of Complex Parts with Analogies
The complexity of PaperCoder lies in its structured workflow. We can understand it with an analogy to a professional software engineering team being tasked with building an application based on a client’s high-level idea.
- The “Client Request”: The scientific paper. It’s like a detailed vision document from a non-technical client — it explains what the system should do and why, but not how to build it in software terms.
- Naive Approach (Hiring one junior dev): Giving the paper to a single LLM is like giving the 20-page vision document to a junior developer and saying, “Go build this.” The result will be a chaotic mess. They’ll miss key requirements, hardcode important values, and create code that’s impossible to maintain or modify.
- The PaperCoder Approach (An Expert Software Team):
- Project Manager & Lead Architect (The Planning Stage):
- Overall Plan: The Project Manager reads the client’s document and creates a high-level feature list. “We need a user authentication module, a data processing pipeline, and a results dashboard.”
- Architecture Design: The Lead Architect takes this list and creates the formal blueprints: UML class diagrams to define the database schema and object models, and sequence diagrams to map out the API call flow.
- Logic Design: The architect then creates a project plan (like a Gantt chart) that defines task dependencies. “We can’t build the dashboard (main.py) until the data pipeline (dataset_loader.py) and the core algorithm (model.py) are complete and tested.”
- Configuration: The team decides to put all sensitive keys and tunable parameters into a .env or config.yaml file, following best practices.
- Team Lead (The Analysis Stage):
- The Team Lead takes the architect’s blueprints and breaks them down into specific, actionable tickets in a system like Jira. For each file, there’s a ticket with a detailed description: “Implement the DatasetLoader class. It must have a load_data() method that accepts a file path and returns a PyTorch Tensor…”
- Developers (The Coding Stage):
- The developers pick up the Jira tickets in the prioritized order. The developer working on main.py can see the completed and merged code for dataset_loader.py and model.py, allowing them to import and use those modules correctly.
This analogy highlights why PaperCoder works: it imposes the discipline and foresight of a professional software development lifecycle onto the chaotic task of interpreting a scientific paper.
4. Key Experimental Findings
- Overall Performance: On their newly created Paper2CodeBench and the existing PaperBench, PaperCoder significantly outperforms all baselines, including other multi-agent frameworks like ChatDev and MetaGPT.
- Evaluation Methodology: The paper introduces a robust evaluation framework using LLMs as judges, with two protocols:
- Reference-Based: Compares the generated code to the author’s official repository (if available).
- Reference-Free: Assesses the code’s faithfulness based solely on the text of the paper.
- Correlation with Human Judgment: The automated, model-based evaluation scores show a strong positive correlation (Pearson r = 0.79) with scores from human experts (the papers’ authors). This validates the use of LLMs-as-judges for this complex task.
- Ablation Studies: The results confirm that each component of the Planning-Analysis-Coding pipeline contributes positively to the final performance. Adding the Logic Design stage (which defines execution order) provides one of the most significant boosts, resolving inconsistencies introduced by the architectural design alone.
- Executability: The generated code is highly practical. A manual analysis found that on average, only 0.81% of code lines needed minor modifications (e.g., updating a deprecated API call) to become fully executable.
- Qualitative Feedback: Human evaluators overwhelmingly preferred PaperCoder’s output, citing its completeness, clean structure, and faithfulness to the original paper as key reasons. 92% of experts agreed the generated code would facilitate reproduction.
5. Conclusion
PaperCoder represents a significant leap forward in automated code generation, moving from well-defined, function-level tasks to the highly complex, repository-level challenge of implementing an entire scientific paper. Its core contribution is a structured, top-down, multi-agent framework that mimics the best practices of human software engineering to decompose the problem into manageable stages. By systematically planning, analyzing, and then coding, PaperCoder produces repositories that are not only conceptually correct but also structurally sound and practically useful, offering a powerful tool to accelerate scientific discovery and address the reproducibility crisis.
machine learning domains) into code repositories, which consists of three sequential steps: planning, analysis,
and coding. (b) Code availability, where blue bars indicate the total number of accepted papers and orange
regions show those with officially released code (See Appendix B.1 for details on the availability calculation).
21. CodeARC: Benchmarking reasoning capabilities of LLM agents for inductive program synthesis
This paper introduces CodeARC, a new and more realistic way to test how well LLM agents can write computer programs based only on examples of what the program should do. This task is called inductive program synthesis.
Instead of a simple, static test where the AI gets a few examples and one chance to write the code, CodeARC creates an interactive game. The AI agent can actively investigate a hidden, unknown program by giving it new inputs to see what it outputs, and then asking a special “oracle” to check its own proposed code for mistakes. The best AI model tested, OpenAI’s o3-mini, could only solve about half the problems (52.7%), showing this is a very difficult task. The researchers also showed that by training a smaller model on the reasoning steps of a more powerful model, they could significantly improve its performance.
1. The Core Problem: Writing Code from Examples (Inductive Program Synthesis)
Imagine you find a mysterious machine with a slot for an input and a chute for an output. You don’t have the user manual. Your goal is to build a replica of this machine. This is the core idea of inductive program synthesis.
- You are given a few examples:
- Input: [1, 2, 3] → Output: True
- Input: [1, 2, 2] → Output: False
- Input: [] → Output: True
- Your task is to write a Python function that replicates this behaviour for all possible inputs, not just the ones you’ve seen. You might guess the function checks if all elements in a list are unique.
The Analogy: You are a code detective trying to reverse-engineer a “black box” function. You can’t see the source code; you can only observe its behaviour.
2. The Flaw in Existing Tests
Previous methods for testing AI on this task were like a rigid, multiple-choice exam.
- Static and Limited Examples: The AI was given a fixed set of 10 input-output examples.
- No Feedback: The AI wrote its function, and it was graded against a hidden set of test cases. If it failed, it just got a “wrong” score with no explanation and no second chance.
The paper argues this is unrealistic and doesn’t measure true reasoning. A small set of examples might not reveal the full picture. For instance, two different rounding functions can produce the same output for many numbers but behave differently for numbers ending in “.5".
The Analogy: This old method is like asking a detective to solve a case based on a few initial clues, with no ability to interview new witnesses or visit the crime scene. They just have to write down their conclusion and hope it’s right.
3. The CodeARC Solution: An Interactive Investigation
CodeARC turns the test into an interactive investigation, mimicking how a real human would solve this problem. The AI agent has two powerful tools and a limited budget to use them.
- Function Invocation (Asking for More Clues): The agent can create new inputs and query the hidden “black box” function to get more input-output examples. This allows it to test specific hypotheses.
- Example: The agent suspects the function specially handles negative numbers. It can ask, “What is the output for [-1, -2, -1]?”
- Differential Testing Oracle (Getting Your Theory Checked): After gathering clues, the agent writes a candidate function. It can then submit this function to an “oracle.”
- Analogy: The oracle is like a super-intelligent Quality Assurance (QA) engineer who knows the correct answer. It compares the agent’s function to the hidden one and tries to find any input where they disagree.
- If they match: The oracle returns PASS. The agent has succeeded.
- If they differ: The oracle returns FAIL along with a counterexample — a specific input that breaks the agent’s code (e.g., “Your function fails for the input [1, [5,6], [5,6]] because the hidden function gives an error, but yours does not.”). This feedback is crucial for self-correction.
The agent operates under a budget, limiting how many new examples it can ask for (Bio) and how many times it can consult the oracle (Boracle).
4. The Benchmark and Experiments
- Dataset: The researchers created a large benchmark of 1,114 diverse Python functions sourced from existing coding challenges (HumanEval, MBPP, APPS).
- Two Versions:
- Annotated: The function has a descriptive name (e.g., is_palindrome), which gives the AI a hint.
- Anonymised: All functions are named solution, forcing the AI to rely purely on reasoning from the examples.
- Models Tested: 18 different Large Language Models were evaluated.
5. Key Findings
- CodeARC is Very Challenging: The best model, o3-mini, only achieved a 52.7% success rate on the anonymised dataset. This shows that even top AI models struggle with true inductive reasoning.
- Reasoning-Focused Models Excel: Models known for strong reasoning capabilities (like o3-mini and DeepSeek-R1) performed the best and were more efficient, needing fewer examples and oracle checks.
- Static Examples are Not Enough: The paper proved that relying only on the initial 10 examples is insufficient. In 28.7% of cases, a function that perfectly matched the initial examples was still incorrect and failed the oracle’s more rigorous tests. This validates the need for CodeARC’s interactive approach.
- Interaction Helps: Performance consistently improved when models were given larger budgets for asking for more examples and using the oracle for feedback.
- Anonymisation Hurts, But Doesn’t Change Rankings: Models performed slightly worse on the anonymised dataset, confirming that function names provide useful clues. However, the best models remained the best, showing their strength was in reasoning, not just pattern-matching names.
6. Improving AI Performance: Fine-Tuning with a “Teacher”
The researchers improved the performance of a smaller model (LLaMA-3.1–8B-Instruct) using a clever training technique.
- The Analogy: This is like a student learning to solve a maze by watching an expert who has the map.
- A powerful “teacher” model (GPT-4o) was answered (the hidden function’s code).
- The teacher then generated a perfect “thought process” for solving the problem: it asked the most informative questions, explained its reasoning, and then wrote the correct code.
- The “student” model (LLaMA-3.1–8B) was trained on these perfect reasoning traces without ever seeing the answer itself. It learned the strategy of how to think and investigate.
This fine-tuning resulted in a 31% relative performance gain, showing that teaching AI models how to reason is a promising direction.
Conclusion
CodeARC provides a much-needed, more realistic benchmark for evaluating the programming and reasoning skills of AI agents. By moving from a static “exam” to an interactive “investigation,” it reveals the current limitations of LLMs in inductive synthesis and highlights the importance of abilities like strategic exploration and self-correction. It serves as a challenging testbed for developing smarter and more capable AI programmers.
22. Agentic reasoning: A streamlined framework for enhancing LLM reasoning with agentic tools
The paper introduces Agentic Reasoning, a framework designed to enhance the capabilities of Large Language Models (LLMs) in solving complex problems that require in-depth research and multi-step thinking. The core idea is to give the main reasoning LLM a small, powerful team of “specialist agents” it can delegate tasks to.
These agents act as tools: a Web-Search agent for finding up-to-date information, a Code agent for performing calculations, and a novel Mind-Map agent that acts as the LLM’s external memory and organisational tool. This Mind-Map builds a structured graph of the reasoning process, preventing the LLM from getting lost or confused in long tasks. By combining these three tools, the framework significantly enhances the performance of an existing LLM (DeepSeek-R1), achieving state-of-the-art results among public models and making it competitive with leading proprietary systems, such as OpenAI’s Deep Research.
1. The Core Problem: LLMs are Book-Smart, Not Street-Smart
Modern LLMs are incredibly knowledgeable, but their intelligence is often confined to the information they were trained on. They are like a brilliant scholar who has read every book in a 2023 library but has no access to the internet, a calculator, or a whiteboard to organise their thoughts.
This leads to several key weaknesses:
- Outdated Knowledge: They can’t answer questions about recent events or access real-time data.
- Poor at Complex Calculations: While they can handle simple math, they struggle with precise, multi-step computations.
- Losing the Plot: In a long and complex reasoning process (e.g., writing a detailed financial report), they can forget earlier points, repeat themselves, or get sidetracked, just like a person trying to juggle too many ideas at once.
Existing solutions often focus on just one tool, like adding a simple web search, but don’t fully integrate a suite of tools in a smart, streamlined way.
2. The Solution: Agentic Reasoning, An LLM with a Specialist Team
Agentic Reasoning transforms the LLM from a solitary thinker into a project manager with a team of expert assistants. When the main LLM encounters a task it can’t handle alone, it pauses its reasoning, delegates the task to the appropriate agent, and then seamlessly integrates the agent’s findings to continue its work.
The Analogy: Imagine an expert consultant hired to solve a complex business problem.
- They don’t just rely on their memory.
- If they need current market data, they ask their Research Assistant (the Web-Search agent) to compile a report.
- If they need to run financial models, they ask their Quantitative Analyst (the Code agent) to run the numbers.
- Throughout the project, they use a whiteboard (the Mind-Map agent) to track all the moving parts, draw connections between ideas, and maintain a clear overview of the strategy.
This is exactly how Agentic Reasoning works. The main LLM is the consultant, and it calls on its specialist agents as needed.
3. Meet the Agents: The Specialist Toolbox
The framework identifies three universally effective agents:
a) The Web-Search Agent (The Super-Researcher)
This isn’t just a simple Google search. It’s a sophisticated research process.
- Query Breakdown: It takes a vague request from the LLM (e.g., “Find external economic indicators”) and breaks it into specific, searchable queries (e.g., “U.S. Q4 2024 inflation rate,” “U.S. Q4 2024 Consumer Confidence Index”).
- Search and Re-rank: It fetches web pages and uses a smart re-ranking model to prioritise the most relevant and high-quality sources.
- Synthesise: It reads the best sources and creates a concise, synthesised summary to give back to the main LLM.
b) The Code Agent (The On-Call Programmer)
When the LLM needs to perform a calculation or a data analysis task, it offloads it to a specialised coding model. This agent writes the code, executes it securely, and returns the result in natural language. This keeps the main LLM focused on high-level reasoning instead of getting bogged down in coding details.
c) The Mind-Map Agent (The Intelligent Whiteboard)
This is the most innovative part of the framework. It acts as the LLM’s structured memory to ensure coherence during long and complex tasks.
- Building a Knowledge Graph: As the LLM reasons, the Mind-Map agent listens in and builds a knowledge graph — a structured network of key concepts (entities) and the logical relationships between them.
- Providing Context: It can provide a clean, summarised context to the other agents so their work is more relevant.
- Answering Questions: If the main LLM gets confused, it can directly query the Mind-Map like an external memory (e.g., “What was the conclusion from the first web search?”).
The Analogy in Action: The paper highlights its effectiveness in the social deduction game Werewolf. The Mind-Map was crucial for tracking who said what, identifying contradictions in players’ arguments, and mapping out alliances and deceptions, leading to a 72% win rate. Without it, the model was easily confused, and its win rate dropped to 36%. It also helps solve tricky logic riddles that often fool LLMs by forcing them to map out the relationships explicitly.
4. Key Findings and Results
- State-of-the-Art Performance: Agentic Reasoning significantly boosted the performance of the DeepSeek-R1 model, achieving a new SOTA on several expert-level benchmarks and closing the gap with top-tier proprietary models like OpenAI’s.
- Quality Over Quantity: The researchers found that giving an LLM a massive toolbox with over 100 tools (from frameworks like LangChain) actually worsened performance. A small, highly effective, and well-integrated set of tools is far better.
- Synergistic Effects: The tools work better together. For example, combining the Web-Search and Mind-Map agents produced a greater improvement than the sum of their individual benefits. The combination of all three agents was the most powerful.
- Excels at Deep Research: In tasks requiring the generation of comprehensive, Wikipedia-style articles, the framework massively outperformed other methods (including Gemini’s Deep Research) in human evaluations across metrics like organisation, relevance, and coverage.
Conclusion
Agentic Reasoning provides a powerful and streamlined blueprint for making LLMs smarter and more capable. By equipping a central reasoning model with a small team of specialist agents — a web-researcher, a coder, and a crucial “mind-map” for memory and organisation — it allows the AI to tackle complex, knowledge-intensive tasks that were previously out of reach. The framework proves that the future of advanced AI reasoning lies not in just building bigger models, but in creating intelligent systems where models and tools work together in a structured, synergistic way.
References
[1] Rasheed, Z., Sami, M. A., Kemell, K., Waseem, M., Saari, M., Systä, K., & Abrahamsson, P. (2024). CodePori: Large-scale system for autonomous software development using multi-agent technology. arXiv. https://doi.org/10.48550/arXiv.2402.01411
[2] Ruan, S., Yang, C., & Liu, T. (2024). SpecRover: Code intent extraction via LLMs. arXiv.
[3] Tang, J., Liu, Z., Wang, S., Wu, Z., & Chen, Z. (2024). Code repair with LLMs gives an exploration-exploitation tradeoff. arXiv.
[4] Wang, S. (2024). MatPlotAgent: Method and evaluation for LLM-based agentic scientific data visualisation. arXiv.
[5] Wang, Z., Wang, W., Li, Z., Wang, L., Yi, C., Xu, X., … Zhou, J. (2024). XUAT-Copilot: Multi-agent collaborative system for automated user acceptance testing with a large language model. arXiv. https://doi.org/10.48550/arXiv.2401.02705
[6] Wu, C., Lin, J., Zhang, S., Li, G., & Zhang, T. (2024). CYCLE: Learning to self-refine the code generation. arXiv.
[7] Xia, C. S. (2024). Agentless: Demystifying LLM-based software engineering agents. arXiv.
[8] Yang, C., Wang, J., Yuan, J., Zou, H., Zhang, K., Zhang, Z., … & Duan, N. (2024). SWE-agent: Agent-computer interfaces enable automated software engineering. ICLR.
[9] Zhang, K., Du, Y., Wu, H., Wang, W., & Zhang, Y. (2024). CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv.
[10] Zhong, L., Wang, Z., & Shang, J. (2024). Debug like a human: A large language model debugger via verifying runtime execution step-by-step. arXiv. https://doi.org/10.48550/arXiv.2402.16906
2025
[11] Adnan, M., Xu, Z., & Kuhn, C. C. N. (2025). Large language model guided self-debugging code generation. arXiv. https://doi.org/10.48550/arXiv.2502.02928
[12] Almorsi, A., Ahmed, M., & Gomaa, W. (2025). Guided code generation with LLMs: A multi-agent framework for complex code tasks. IEEE JAC-ECC.
[13] Chen, X., Tao, Z., Zhang, K., Zhou, C., Gu, W., He, Y., … Jin, Z. (2025). Revisit self-debugging with self-generated tests for code generation. ACL.
[14] Huynh, N., & Lin, B. (2025). Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications. arXiv. https://doi.org/10.48550/arXiv.2503.01245
[15] Jia, Y., et al. (2025). MemoCoder: Automated function synthesis using LLM-supported agents. arXiv.
[16] Kang, M., et al. (2025). Distilling LLM agent into small models with retrieval and code tools. arXiv.
[17] Kobaladze, Z., Arnania, A., & Sanikidze, T. (2025). From provable correctness to probabilistic generation: A comparative review of program synthesis paradigms. arXiv. https://doi.org/10.48550/arXiv.2508.00013
[18] Maddila, C., et al. (2025). Agentic program repair from test failures at scale: A neuro-symbolic approach with static analysis and test execution feedback. arXiv.
[19] Pan, R., Zhang, H., & Liu, C. (2025). CodeCoR: An LLM-based self-reflective multi-agent framework for code generation. arXiv. https://doi.org/10.48550/arXiv.2501.07811
[20] Seo, M., Baek, J., Lee, S., & Hwang, S. J. (2025). Paper2Code: Automating code generation from scientific papers in machine learning. arXiv. https://doi.org/10.48550/arXiv.2504.17192
[21] Wei, A. (2025). CodeARC: Benchmarking reasoning capabilities of LLM agents for inductive program synthesis. arXiv.
[22] Wu, J. (2025). Agentic reasoning: A streamlined framework for enhancing LLM reasoning with agentic tools. arXiv
[23] Robeyns, Szummer, Aitchison, (2025), A Self-Improving Coding Agent https://arxiv.org/abs/2504.15228
