I put GitHub Copilot's AI to the test – and it just might be terrible at writing code


ZDNET

The thing I find most baffling about the programming tests I’ve been running is that tools based on the same large language model tend to perform quite differently.

Also: The best AI for coding in 2025 (and what not to use)

For example, ChatGPT, Perplexity, and GitHub Copilot are all based on the GPT-4 model from OpenAI. But, as I’ll show you below, while ChatGPT and Perplexity’s pro plans performed excellently, GitHub Copilot failed as often as it succeeded.

I tested GitHub Copilot embedded inside a VS Code instance. I’ll explain how to set that up and use GitHub Copilot in an upcoming step-by-step article. But first, let’s run through the tests.

If you want to know how I test and the prompts for each individual test, feel free to read how I test an AI chatbot’s coding ability.

TL;DR: GitHub Copilot passed two and failed two.

Test 1: Writing a WordPress Plugin

So, this failed miserably. This was my first test, so I can’t tell yet whether GitHub Copilot is terrible at writing code or whether the context in which one interacts with it is limiting to the point where it can’t meet this requirement.

Let me explain.

This test involves asking the AI to create a fully functional WordPress plugin, complete with admin interface elements and operational logic. The plugin takes in a set of names, sorts them, and, if there are duplicates, separates the duplicates so they’re not side by side.

Also: I tested DeepSeek’s R1 and V3 coding skills – and we’re not all doomed (yet)

This was a real-world application that my wife needed as part of an involvement device she runs on her very active Facebook group as part of her digital goods e-commerce business.

Most of the other AIs passed this test, at least partly. Five of the 10 AI models tested passed the test completely. Three of them passed part of the test. Two (including Microsoft Copilot) failed completely.

The thing is, I gave GitHub Copilot the same prompt I give all of them, but it only wrote PHP code. To be clear, this problem can be solved solely using PHP code. But some AIs like to include some JavaScript for the interactive features. GitHub Copilot included code for using JavaScript but never actually generated the JavaScript that it tried to use.

random1

Screenshot by David Gewirtz/ZDNET

What’s worse, when I created a JavaScript file and, from within the JavaScript file, tried to get GitHub Copilot to run the prompt, it gave me another PHP script, which also referenced a JavaScript file.

As you can see below, within the randomizer.js file, it tried to enqueue (basically to bring in to run) the randomizer.js file, and the code it wrote was PHP, not JavaScript.

randomjs

Screenshot by David Gewirtz/ZDNET

Test 2: Rewriting a string function

This test is fairly simple. I wrote a function that was supposed to test for dollars and cents but wound up only testing for integers (dollars). The test asks the AI to fix the code.

GitHub Copilot did rework the code, but there were a bunch of problems with the code it produced.

  • It assumed a string value was always a string value. If it was empty, the code would break.
  • The revised regular expression code would break if a decimal point (i.e., “3.”) was entered, if a leading decimal point (i.e., “.3”) was entered, or if leading zeros were included (i.e., “00.30”).

For something that was supposed to test whether currency was entered correctly, failing with code that would crash on edge cases is not acceptable.

So, we have another fail.

Test 3: Finding an annoying bug

GitHub Copilot got this right. This is another test pulled from my real-life coding escapades. What made this bug so annoying (and difficult to figure out) is that the error message isn’t directly related to the actual problem.

Also: I put DeepSeek AI’s coding skills to the test – here’s where it fell apart

The bug is kind of the coder equivalent of a trick question. Solving it requires understanding how specific API calls in the WordPress framework work and then applying that knowledge to the bug in question.

Microsoft Copilot, Gemini, and Meta Code Llama all failed this test. But GitHub Copilot solved it correctly.

Test 4: Writing a script

Here, too, GitHub Copilot succeeded where Microsoft Copilot failed. The challenge here is that I’m testing the AI’s ability to create a script that knows about coding in AppleScript, the Chrome object model, and a little Mac-only third-party coding utility called Keyboard Maestro.

Also: X’s Grok did surprisingly well in my AI coding tests

To pass this test, the AI has to be able to recognize that all three coding environments need attention and then tailor individual lines of code to each of those environments.

Final thoughts

Given that GitHub Copilot uses GPT-4, I find the fact that it failed half of the tests discouraging. GitHub is just about the most popular source management environment on the planet, and one would hope that the AI coding support was reasonably reliable.

As with all things AI, I’m sure performance will get better. Let’s stay tuned and check back in a few months to see if the AI is more effective at that time.

Do you use an AI to help with coding? What AI do you prefer? Have you tried GitHub Copilot? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.





Source link

Leave a Comment