It seems like every site I go to with DevOps or Software Engineering news is inundated with — and I mean positively drowning in — articles about LLMs doing the job of engineers.
I was away for the weekend, and this is a smattering of the headlines I came back to, just on the first couple sites that I frequently read from:
- AI in Software Development: Productivity at the Cost of Code Quality?
- Gemini Code Assist Goes Free: What DevOps Teams Need to Know About Google’s AI Coding Tool
- AI in Dev Tools: Accelerating but Learning the Limits
- AI Coding: New Research Shows Even the Best Models Struggle With Real-World Software Engineering
To be fair, the tone has generally shifted away from the early Chicken Little sky-is-falling doomsday analysis; three of those four articles take a much more limited view of AI’s ability to code (and the fourth isn’t testing the possibility of replacing SWEs).
There have been benchmarks done on the coding ability of AI, but the last article above — “AI Coding: New Research Shows…” — talks about a new academic paper proposing a more in-depth benchmark that perhaps better captures the work done by a SWE or SWE Manager.
The study, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?“, describes their new benchmark. I’ll let the abstract speak for itself:
We introduce SWE-Lancer, a benchmark of over
1,400 freelance software engineering tasks from
Upwork, valued at $1 million USD total in real-world
payouts. SWE-Lancer encompasses both
independent engineering tasks — ranging from
$50 bug fixes to $32,000 feature implementations
— and managerial tasks, where models
choose between technical implementation proposals.
Independent tasks are graded with end-to-
end tests triple-verified by experienced software
engineers, while managerial decisions are
assessed against the choices of the original hired
engineering managers. We evaluate model performance
and find that frontier models are still
unable to solve the majority of tasks… By
mapping model performance to monetary value,
we hope SWE-Lancer enables greater research
into the economic impact of AI model development.
Some interesting things from the article:
- They allow the real payouts act as a proxy for the difficulty of the task.
- They use the real-world Expensify open-source repository, and the tasks sometimes require the context of different parts of the code base to solve.
- They grade by end-to-end tests instead of unit tests. This is much less susceptible to manipulation and it provides a better insight into the actual efficacy of the solution within the product.
- They used a variety of tasks categorized either as “IC SWE” for implementation tasks or “SWE Manager” for making a choice between different proposals.
Results
On the IC SWE tasks, the article compared OpenAI’s ChatGTP o1 model against Anthropic’s Claude 3.5 Sonnet. (They also included GPT-4o, which performed less well.)
| Model | Tasks Passed | Money Earned |
|---|---|---|
| GPT o1 | 16.5% | 12.1% |
| Claude 3.5 Sonnet | 26.2% | 24.5% |
Notice that the percentage of tasks is higher than the money earned. This tells me that on average the models were more successful at the easier tasks, as one might expect.
The rates of success were much higher on the SWE Manager tasks:
| Model | Tasks Passed | Money Earned |
|---|---|---|
| GPT o1 | 41.5% | 51.8% |
| Claude 3.5 Sonnet | 44.9% | 56.8% |
Interestingly, the AI seems to have done well at harder-than-average tasks.
I’d like to also point out that since the SWE Manager tasks involved choosing between 4-5 proposed solutions, a coin-flip would have passed 20-25% of tasks. I don’t say this to minimize the results; only to say that there’s no valuable comparison between the performance on the two data sets.
So what does this mean? Am I out of a job yet? Probably not, but it’s very useful to have better benchmarks.
I’d really like to see a proper analysis of how well the AIs do based on the difficulty of the task: are they competent at junior-type problems yet?
And for comparison, I’d like to see how actual human engineers do on these benchmarks.


