I retested Copilot's AI coding skills after last year's strikeout and now it's a slugger


jgroup/Getty Images

There’s been a ton of buzz about how AIs can help programming, but in the first year or two of generative AI, much of that was hype. Microsoft ran huge events celebrating how Copilot could help you code, but when I put it to the test in April 2024, it failed all four of my standardized tests. It completely struck out. Crashed and burned. Fell off the cliff. It performed the worst of any AI I tested.

Mixed metaphors aside, let’s stick with baseball. Copilot traded its cleats for a bus pass. It was not worthy.

Also: The best AI for coding in 2025 (and what not to use)

But time spent in the bullpen of life seems to have helped Copilot. This time, when it showed up for tryouts, it was warmed up and ready to step into the box. It was throwing heat in the bullpen. When it was time to play, it had its eye on the ball and its swing dialed in. Clearly, it was game-ready and looking for a pitch to drive.

But could it withstand my tests? With a squint in my eye, I stepped onto the pitcher’s mound and started off with an easy lob. Back in 2024, you could feel the wind as Copilot swung and missed. But now, in April 2025, Copilot connected squarely with the ball and hit it straight and true.

Also: How I test an AI chatbot’s coding ability – and you can, too

We had to send Copilot down, but it fought its way back to the show. Here’s the play-by-play.

1. Writing a WordPress plugin

Well, Copilot certainly improved since its first run of this test in April 2024. The first time, it didn’t provide code to actually display the randomized lines. It did store them in a value, but it didn’t retrieve and display them. In other words, it swung and missed. It didn’t produce any output.

This is the result of the latest run:

line-shuffler

Screenshot by David Gewirtz/ZDNET

This time, the code worked. It did leave a random extra blank line at the end, but since it fulfilled the programming assignment, we’ll call it good.

Also: How to use ChatGPT to write code – and my favorite trick to debug what it generates

Copilot’s unbroken streak of absolutely unmitigated programming failures has been broken. Let’s see how it does in the rest of the tests.

2. Rewriting a string function

This test is designed to test dollars and cents conversions. In my first test back in April 20224, the Copilot-generated code did properly flag an error if a value containing a letter or more than one decimal point is sent to it, but didn’t perform a complete validation. It allowed results through that could have caused subsequent routines to fail.

Also: How I used ChatGPT to write a custom JavaScript bookmarklet

This run, however, did pretty well. It performs most of the tests properly. It returns false for numbers with more than two digits to the right of the decimal point, like 1.234 and 1.230. It also returns false for numbers with extra leading zeros. So 0.01 is allowed, but 00.01 is not.

Technically, these values could be converted to usable currency values, but it’s never bad for a validation routine to be strict in its tests. The main goal is that the validation routine doesn’t let a value through that could cause a subsequent routine to crash. Copilot did good here.

We’re now at two for two, a huge improvement over its results from its first run.

3. Finding an annoying bug

I gotta tell you how Copilot first answered this back in April 2024, because it’s just too good.

Also: Why I just added Gemini 2.5 Pro to the very short list of AI tools I pay for

This tests the AI’s ability to think a few chess moves ahead. The answer that seems obvious isn’t the right answer. I got caught by that when I was originally debugging the issue that eventually became this test.

On Copilot’s first run, it suggested I check the spelling of my function name and the WordPress hook name. The WordPress hook is a published thing, so Copilot should have been able to confirm spelling. And my function is my function, so I can spell it however I want. If I had misspelled it somewhere in the code, the IDE would have very visibly pointed it out.

And it got better. Back then, Copilot also quite happily repeated the problem statement to me, suggesting I solve the problem myself. Yeah, its entire recommendation was that I debug it. Well, duh. Then, it ended with “consider seeking support from the plugin developer or community forums. 😊” — and yeah, that emoji was part of the AI’s response.

It was a spectacular, enthusiastic, emojic failure. See what I mean? Early AI answers, no matter how useless, should be immortalized.

Especially when Copilot wasn’t nearly as much fun this time. It just solved it. Quickly, cleanly, clearly. Done and done. Solved.

cleanshot-2025-04-23-at-10-33-062x

Screenshot by David Gewirtz/ZDNET

That puts Copilot at three-for-three and decisively moves it out of the “don’t use this tool” category. Bases are loaded. Let’s see if Copilot can score a home run.

4. Writing a script

The idea with this test is that it asks about a fairly obscure Mac scripting tool called Keyboard Maestro, as well as Apple’s scripting language AppleScript, and Chrome scripting behavior. For the record, Keyboard Maestro is one of the single biggest reasons I use Macs over Windows for my daily productivity, because it allows the entire OS and the various applications to be reprogrammed to suit my needs. It’s that powerful.

In any case, to pass the test, the AI has to properly describe how to solve the problem using a mix of Keyboard Maestro code, AppleScript code, and Chrome API functionality. 

Also: AI has grown beyond human knowledge, says Google’s DeepMind unit

Back in the day, Copilot didn’t do it right. It completely ignored Keyboard Maestro (at the time, it probably wasn’t in its knowledge base). In the generated AppleScript, where I asked it to just scan the current window, Copilot repeated the process for all windows, returning results for the wrong window (the last one in the chain).

But not now. This time, Copilot did it right. It did exactly what was asked, got the right window and tab, properly talked to Keyboard Maestro and Chrome, and used actual AppleScript syntax for the AppleScript.

Bases loaded. Home run.

Overall results

Last year, I said I wasn’t impressed. In fact, I found the results a little demoralizing. But I also said this:

Ah well, Microsoft does improve its products over time. Maybe by next year.

In the past year, Copilot went from strikeouts to scoreboard shaker. It went from batting cleanup in the basement to chasing a pennant under the lights.

What about you? Have you taken Copilot or another AI coding assistant out to the field lately? Do you think it’s finally ready for the big leagues, or is it still riding the bench? Have you had any strikeouts or home runs using AI for development? And what would it take for one of these tools to earn a spot in your starting lineup? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.





Source link

Leave a Comment