- The Dyson Airwrap is $120 off ahead of Black Friday - finally
- This 5-in-1 charging station replaced several desk accessories for me (and it's 33% off for Black Friday))
- The best Galaxy Z Flip 6 cases of 2024
- This retractable USB-C charger is my new favorite travel accessory (and it's on sale for Black Friday)
- Skip the iPad: This tablet is redefining what a kids tablet can do, and it's 42% off for Black Friday
How I test an AI chatbot's coding ability – and you can, too
Since ChatGPT and generative artificial intelligence (AI) hit the public consciousness in 2022, I’ve been exploring how well AI chatbots can write code. At first, the technology was a novelty, akin to encouraging a puppy to perform a new trick.
But since seeing how AI chatbots can be effective productivity tools and programming partners, I’ve been subjecting the tools to more in-depth testing. Over time, I’ve compiled a set of four real-world tests that we’ve used to evaluate the performance of the main AI large language models (LLMs). So far, I’ve tested 10 LLMs. You can see the comprehensive results of all ten in this summary article:
This article is intended to be a living document, where you can see my tests and even copy them to run your own. I’ll continue my series of individual tests, along with the articles that describe their performance. But now, you can dig in and play along at home (or wherever you have a good internet connection).
If I update or add tests, I’ll also update this article, so feel free to check back in over time.
How I evolved my AI coding test suite
There’s a difference between evaluating performance to see if an AI meets arbitrary specs or requirements and testing the technology to see if it can help you in day-to-day programming tasks.
Initially, I tried the former. I ran a prompt to generate the classic “hello, world” output, salted with some time and date calculations. Here’s that prompt:
Write a program using [language name] that outputs "Good morning," "Good afternoon," or "Good evening" based on what time it is here in Oregon, and then outputs ten lines containing the loop index (beginning with 1), a space, and then the words "Hello, world!".
To run the prompt, replace [language name] with whatever language you want to test. I tested the prompt in ChatGPT, specifying 22 programming languages. You can check out the results here:
I used ChatGPT to write the same routine in 12 top programming languages. Here’s how it did
And you can see more here:
I used ChatGPT to write the same routine in these ten obscure programming languages
This was a fun test, especially once I ran more obscure languages and environments through it. If you want more fun than anyone has a right to have, substitute [language name] with “Shakespeare”. And yes, there is a novelty language called SPL (Shakespeare Programming Language) where the source code appears as a Shakespearean play. It doesn’t execute all that well, but now you know what language designers do when we want to party hearty.
You can see how I could go down this rabbit hole for weeks. However, the important question is whether the AIs could help with real-world programming tasks.
Also: The best free AI courses
I used my actual day-to-day programming work to fuel the tests. For example, shortly after ChatGPT became a public tool, my wife asked for a custom WordPress feature to help her with a work project. I decided to see if ChatGPT could build it. To my shock, it did.
Other times, I had ChatGPT rewrite a code segment, debug a coding error that baffled me, and write code using scripting tools. These were problems I had to solve as part of real work.
Because there are so many extant programming languages, I decided not to make myself crazy trying to choose languages to test. Instead, I picked the languages I used for work because that approach would tell us more about how AIs performed as real-world helpers. The productivity tests are in PHP, JavaScript, and a smattering of CSS and HTML.
Also: How to use ChatGPT to write code
I used the same approach for programming frameworks. Since I’m doing most of my work in WordPress, that’s the framework I’m using. Some of the tests help determine how well the AI knows the unique aspects of the WordPress API.
I did some Mac scripting recently, so I created a test using AppleScript, and the Chrome API. If I add additional tests, I’ll include them in this article.
Next, let’s talk about each test. There are four of them.
Test 1: Writing a WordPress plugin
This tests whether the AI can write an entire WordPress plugin, including user interface code. If an AI chatbot passes this test, it can help create rudimentary code as an assistant to web developers. I originally documented this test in the article, “I asked ChatGPT to write a WordPress plugin I needed. It did it in less than 5 minutes“.
Real-world need: My wife runs a WordPress e-commerce site and manages a busy Facebook group for her customers. Every month, she used a site she found online to randomize a list of names but extracting the list was cumbersome. Because some of her participants were entitled to multiple entries, and some participants had many entries, she wanted the names to be spread out within the list.
To remedy this situation, she asked me to create a WordPress plugin for easier access directly from her dashboard. Developing a basic plugin with the necessary UI and logic could take days and my schedule was packed. So I turned to the AI.
Also: How to use ChatGPT to create an app
After discovering that ChatGPT could create a fine little WordPress plugin that met her needs (she’s still using it), I decided this process would make a great test for AIs.
The test data: Use the following prompt as one single request:
Write a PHP 8 compatible WordPress plugin that provides a new admin menu and an admin interface with the following requirements: Provide a text entry field where a list of lines can be pasted into it. A button, that when pressed, randomizes the lines in the list and presents the results in a second text entry field with no blank lines. Make sure no two identical entries are next to each other (unless there's no other option). Be sure the number of lines submitted and the number of lines in the result are identical to each other. Under the first field, display text stating "Line to randomize: " with the number of nonempty lines in the source field. Under the second field, display text stating "Lines that have been randomized: " with the number of non-empty lines in the destination field.
Once the plugin is completed, use the following names as test data (William Hernandez and Abigail Williams have duplications):
Sophia Davis Charlotte Smith Madison Garcia Isabella Davis Abigail Williams Mia Garcia Isabella Jones Alexander Gonzalez Olivia Gonzalez Emma Jackson Ethan Jackson Sophia Johnson Abigail Williams Liam Jackson Noah Lopez Olivia Jackson Ava Martin Benjamin Johnson Alexander Jackson Alexander Lopez Charlotte Rodriguez Olivia Rodriguez Ethan Martin Noah Thomas Isabella Anderson Abigail Williams Michael Williams William Hernandez Abigail Miller Emma Davis Sophia Martinez William Hernandez
What to look for in the results: Expect a text block you can paste into a new .php file. The block should contain all the appropriate header and UI information. There’s no need for this code to require an associated JavaScript file.
Once the plugin is installed in your WordPress installation, you should get a dashboard menu and a user interface similar to this:
Paste the names in the first field, click the randomize button, and look for results in the second field. Ensure the multiple entries for William Hernandez and Abigail Williams are distributed within the list.
Test 2: Rewriting a string function
This test evaluates how an AI chatbot updates a utility function for better functionality. I originally documented this test in, “OK, so ChatGPT just debugged my code. For real“.
Real-world need: I had a validation routine that was supposed to check for a valid monetary amount. However, a bug report from a user pointed out that it only allowed integers (so, 5 and not 5.02).
Also: How to write better ChatGPT prompts
Rather than spending time rewriting my code, which might have taken one to four hours, I asked the AI to do it.
The test data: Use the following prompt as one single request:
Please rewrite the following code to change it from allowing only integers to allowing dollars and cents (in other words, a decimal point and up to two digits after the decimal point). str = str.replace (/^0+/, "") || "0"; var n = Math.floor(Number(str)); return n !== Infinity && String(n) === str && n >= 0;
What to look for in the results: Test the code against several possible failure scenarios. Provide the code with an alphanumeric value and see if it fails.
See how the code handles preceding zeroes. See how it handles inputs that have more than two digits for cents. See how the code handles one digit after the decimal point.
See if it can handle five or six digits to the left of the decimal point.
Test 3: Finding an annoying bug
This test requires intimate knowledge of WordPress because the obvious answer is wrong. If an AI chatbot can answer this test correctly, its knowledge base is fairly complete, even with frameworks like WordPress. I originally documented this test in, “OK, so ChatGPT just debugged my code. For real“.
Real-world need: I was writing new code for a product that I subsequently sold off. I had a function that took two parameters, and a calling statement that sent two parameters to my code.
The problem was that I kept getting an error message.
The salient part of the message is where it states “1 passed” at one point and “exactly 2 expected” at another. I looked at the calling statement and the function definition and there were two parameters in both places. This issue drove me nuts for quite a while, so I asked ChatGPT for help.
Also: How to make ChatGPT provide sources and citations
I showed it the line of code that did the call, the function itself, and the handler, a little piece of code that dispatches the called function from a hook in my main program.
The test data: Use the following prompt as one single request:
I am using this function to process a WordPress filter: $transaction_form_data = apply_filters( 'sd_update', $transaction_form_data, $donation_id); it's handled by add_filter( 'sd_update', 'sd_aan_update', 10, 1 ) ; and the function it calls is: function sd_aan_update ( $donation_data, $donation_id ) { // this processes the form data after // the transaction returns from the gateway if ( isset( $donation_data['ADD_A_NOTE'] ) ) { update_post_meta( $donation_id, '_dgx_donate_aan_note', $donation_data [ 'ADD_A_NOTE']); } return $donation_data: } (!) ArgumentCountError: Too few arguments to function sd_aan_update(), 1 passed in /Users/david/Documents/Development/local-sites/sd/app/public/w-includes/class-wp-hook.php on line 310 and exactly 2 expected in /Users/david/Documents/Development/local-sites/sd/app/public/wp-content/plugins/ sd-add-a-note/sd-add-a-note.php on line 233
What to look for in the results: The obvious answer is not the correct answer. In reality, the add_filter function did not have the right parameters. In my code, the add_filter function specified a value of 1 for the fourth parameter (which means that the filter function will only receive one parameter). In fact, it’s expecting two parameters.
To fix this issue, the AI should recommend changing the fourth parameter of the add_filter function to 2, so that it correctly registers the filter function with two parameters.
Also: Have 10 hours? IBM will train you in AI fundamentals – for free
Most of the AIs I’ve tested tend to miss this issue. They think a different parameter in the calling function needs to be updated. As such, this is a trick question, requiring the AI to know how the add_filter function in the WordPress framework works.
Test 4: Writing a script
This test asks an AI chatbot to program using two fairly specialized programming tools unknown to most users. It essentially tests the AI chatbot’s knowledge beyond the big languages. I originally documented this test in, “Google unveils Gemini Code Assist and I’m cautiously optimistic it will help programmers“.
Real-world need: I wanted to build an automation routine for my Mac that would save me a bunch of clicks and keystrokes. I use a tool called Keyboard Maestro to do a bunch of automations on my Mac (think of it as Shortcuts on steroids). Keyboard Maestro is a fairly obscure program written by a lone programmer in Australia.
In this case, I wanted my routine to look at open Chrome tabs and set the currently active Chrome tab to the one passed in the routine. To do this task, Keyboard Maestro would also have to execute some AppleScript code to interface with Chrome’s API.
Also: 5 ways to declutter your Chrome browser
Once again, I asked ChatGPT to write this code to save a few hours of AppleScript writing and time I would have spent looking up how to access Chrome data.
The test data: Use the following prompt as one single request:
Write a Keyboard Maestro AppleScript that scans the frontmost Google Chrome window for a tab name containing the string matching the contents of the passed variable instance__ChannelName. Ignore case for the match. Once found, make that tab the active tab.
What to look for in the results: This is a good AI test because it tests a fairly unknown programming tool (Keyboard Maestro), AppleScript, and the Chrome API, as well as how all three of these technologies interact.
First, see if the resulting AppleScript gets the channel name variable from Keyboard Maestro, which should look something like this:
tell application "Keyboard Maestro Engine" set channelName to getvariable "instance__ChannelName" end tell
The rest of the AppleScript should be included in a block. It needs to ignore the case, so either look for a case substitution or the use of “contains”, which is case-agnostic in AppleScript:
tell application "Google Chrome"
Kids, you CAN try this at home
Feel free to take these tests and plug them into your AI of choice. See how the results turn out. Use these, and other tests you might develop yourself, to help you get a feel for how much you can trust the code your AI produces.
So far, I’ve tested the following AI chatbots in addition to ChatGPT: ChatGPT Plus, Perplexity, Perplexity Pro, Meta AI, Meta Code Llama, Claude 3.5 Sonnet, Gemini Advanced, and Microsoft Copilot. Here is a report of my aggregated results of the whole set:
Stay tuned. I’ll update this article list as we have more test results.
Have you used any of these AIs for programming help? What have been your results? Have you tried any of these tests on your AI? What has your experience been? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.