X's Grok did surprisingly well in my AI coding tests


ZDNET

When X first came out with its chatbot, it was behind a paywall. But TANSTAAFL notwithstanding, X recently opened up Grok to the world. So I decided to throw my programming tests at it.

Also: How I test an AI chatbot’s coding ability – and you can, too

I’ve always been a bit intrigued by Grok because of the name. Grok was coined by Robert Heinlein, one of my very favorite science fiction writers. I fully credit Heinlein with twisting my young brain.

My parents tightly controlled the media I had access to based on what they considered wholesome and healthy. But they gave me free rein to read whatever limited science fiction I could find in the local library — because the word science meant it had to be educational.

Without geeking out on Heinlein too much, let’s just say that he had a very open mind when it came to societal norms. He wrote powerful stories, included wonderful science-related themes in his narratives, and often injected deep social commentary into his books.

Also: The best AI for coding in 2025 (and what not to use)

He also coined the term “grok” as a Martian word with many broad meanings. First appearing in Stranger in a Strange Land, it can be interpreted as meaning “I understand,” with that understanding existing at a deep, fundamental level. As such, it’s a perfect name for an AI chatbot.

Except…

cleanshot-2025-01-04-at-16-23-442x

Screenshot by David Gewirtz/ZDNET

When I asked Grok about what LLM (large language model) it uses, it decided to also tell me that it was inspired by the wit and rebelliousness of Hitchhiker’s Guide to the Galaxy. While Hitchhiker’s does have wit and it does have rebelliousness, it does not include the word “grok.”

And with that, let’s dive into my programming tests.

1. Writing a WordPress plugin

This is a coding test that requires the AI to know PHP programming and how to construct a WordPress plugin. It was actually born from a real-world request from my wife, who needed a tool to randomize and sort names, but with a twist.

Also: What WordPress users need to know about the Automattic and WP Engine conflict

Every month, she runs an involvement device on her e-commerce site that chooses a bunch of names at random. The gotcha is that some of her users get multiple entries if they submit multiple projects. So, the randomizer has to manage multiple names but also separate them so they’re not side by side in the results.

Finally, the code had to provide a good, clear user interface so that she could simply paste in the names, click a button, and get her list back out.

cleanshot-2025-01-04-at-16-25-492x

Screenshot by David Gewirtz/ZDNET

I fed this assignment to Grok, and it succeeded. The interface was cleanly laid out and functional. And, most importantly, it did what the code was supposed to do, successfully randomizing and separating the names. I give this test a win.

cleanshot-2025-01-04-at-16-21-192x

Screenshot by David Gewirtz/ZDNET

2. Rewriting a string function

My second test resolves a problem first reported to me by a user. The code I had pushed out was designed to test if a number entered by a user was in valid dollars and cents currency. My error was that the code only allowed for integers, so you could donate $5, but not $5.25.

Grok successfully rewrote the regular expression code. It’s very close to a win, but I have to give it a fail because the code it generates doesn’t allow numbers like .5, which is a valid currency amount. It does allow 0.5, but not every user would choose to prepend a zero to the cents value.

Also: Elon Musk’s X now trains Grok on your data by default – here’s how to opt out

It also uses a fairly inefficient mechanism to do double conversions and doesn’t properly handle strings that can’t be converted into a number.

So far, we’re at one win and one loss.

3. Finding an annoying bug

The third test requires knowledge of the WordPress framework and API because the bug I’m asking the AI to find is a subtle one that results from a misinterpretation of the WordPress API requirements.

A number of the LLMs I tested got the problem wrong (as I did for hours when trying to debug it). But Grok grokked the problem and gave me a functionally correct and useful answer.

This brings us to two wins and one loss, pulling Grok ahead of almost half of the other LLMs tested previously. Let’s see how it does on the fourth and final test.

4. Writing a script

This is a tough test because it requires the AI to be aware of a fairly low-volume vertical scripting tool for the Mac called Keyboard Maestro. It also requires the AI to be able to write code for three separate environments at once: Keyboard Maestro, Chrome, and AppleScript.

So far, only Google Gemini and ChatGPT running the GPT-4 and above LLM have passed this test. Even ChatGPT 3.5 failed.

But we have a new AI that can handle this level of coding challenge: Grok. That gives Grok three wins out of four, which puts it ahead of every other AI not based on a ChatGPT LLM.

Final thoughts 

Overall, Grok held its own. If it had only allowed a currency value without a leading zero, it would have had a perfect score. I’m not sure how I feel about all the changes at X since it replaced Twitter, but Grok appears to be a fairly formidable chatbot, at least when it comes to programming prowess.

Also: How to program your iPhone’s Action Button to summon ChatGPT’s voice assistant

What do you think? Have you used Grok? Have you read Stranger in a Strange Land? What about Hitchhiker’s? Let us know in the comments below. So long, and thanks for all the fish.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.





Source link

Leave a Comment