Sunday, April 20, 2025
HomeFuture NewsOpenAI Researchers Find That Even the Best AI Is "Unable To Solve...

OpenAI Researchers Find That Even the Best AI Is “Unable To Solve the Majority” of Coding Problems

Published on

spot_img


OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat “low-level” software engineers by the end of this year.

In a new paper, the company’s researchers found that even frontier models, or the most advanced and boundary-pushing AI systems, “are still unable to solve the majority” of coding tasks.

The researchers used a newly-developed benchmark called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork. Using the benchmark, OpenAI put three large language models (LLMs) — its own o1 reasoning model and flagship GPT-4o, as well as Anthropic’s Claude 3.5 Sonnet — to the test.

Specifically, the new benchmark evaluated how well the LLMs performed with two types of tasks from Upwork: individual tasks, which involved resolving bugs and implementing fixes to them, or management tasks that saw the models trying to zoom out and make higher-level decisions. (The models weren’t allowed to access the internet, meaning they couldn’t just crib similar answers that’d been posted online.)

The models took on tasks cumulatively worth hundreds of thousands of dollars on Upwork, but they were only able to fix surface-level software issues, while remaining unable to actually find bugs in larger projects or find their root causes. These shoddy and half-baked “solutions” are likely familiar to anyone who’s worked with AI — which is great at spitting out confident-sounding information that often falls apart on closer inspection.

Though all three LLMs were often able to operate “far faster than a human would,” the paper notes, they also failed to grasp how widespread bugs were or to understand their context, “leading to solutions that are incorrect or insufficiently comprehensive.”

As the researchers explained, Claude 3.5 Sonnet performed better than the two OpenAI models pitted against it and made more money than o1 and GPT-4o. Still, the majority of its answers were wrong, and according to the researchers, any model would need “higher reliability” to be trusted with real-life coding tasks.

Put more plainly, the paper seems to demonstrate that although these frontier models can work quickly and solve zoomed-in tasks, they’re are nowhere near as skilled at handling them as human engineers.

Though these LLMs have advanced rapidly over the past few years and will likely continue to do so, they’re not skilled enough at software engineering to replace real-life people quite yet — not that that’s stopping CEOs from firing their human coders in favor of immature AI models.

More on AI and coding: Zuckerberg Announces Plans to Automate Facebook Coding Jobs With AI



Source link

Latest articles

Nvidia | Caught in the tech cold war

By the time Nvidia disclosed in a regulatory filing that the U.S. government...

There’s Something Horrifying in Your Toothpaste

Image by Getty / FuturismAlarming new research has found that toothpastes are often...

Gemini Advanced Goes Free For Students Until 2026 Along With NotebookLM Plus, 2TB Cloud Storage

Gemini Advanced — the AI service that's bundled with the Google One AI...

Elon Musk Cuts Funding for Internet Archive

When Donald Trump took office in January, volunteer archivists got to work, ensuring...

More like this

Nvidia | Caught in the tech cold war

By the time Nvidia disclosed in a regulatory filing that the U.S. government...

There’s Something Horrifying in Your Toothpaste

Image by Getty / FuturismAlarming new research has found that toothpastes are often...

Gemini Advanced Goes Free For Students Until 2026 Along With NotebookLM Plus, 2TB Cloud Storage

Gemini Advanced — the AI service that's bundled with the Google One AI...