- Why neglecting AI ethics is such risky business - and how to do AI right
- You should probably clear your TV cache right now (and why it makes such a big difference)
- This secret Pixel camera feature makes your photos look more vibrant - how to turn it on
- Finally, a battery-powered outdoor camera that gets bright enough for darker spaces
- I tested a smart tracker that's thinner than Apple AirTags - and they're even more versatile
The Turing Test has a problem – and OpenAI's GPT-4.5 just exposed it

Most people know that the famous Turing Test, a thought experiment conceived by computer pioneer Alan Turing, is a popular measure of progress in artificial intelligence.
Many mistakenly assume, however, that it is proof that machines are actually thinking.
The latest research on the Turing Test from scholars at the University of California at San Diego shows that OpenAI’s latest large language model, GPT-4.5, can fool humans into thinking that the AI model is a person in text chats, even more than a human can convince another person that they are human.
Also: How to use ChatGPT: A beginner’s guide to the most popular AI chatbot
That’s a breakthrough in the ability of gen AI to produce compelling output in response to a prompt.
Proof of AGI?
But even the researchers recognize that beating the Turing Test doesn’t necessarily mean that “artificial general intelligence,” or AGI, has been achieved — a level of computer processing equivalent to human thought.
The AI scholar Melanie Mitchell, a professor at the Santa Fe Institute in Santa Fe, New Mexico, has written in the scholarly journal Science that the Turing Test is less a test of intelligence per se and more a test of human assumptions. Despite high scores on the test, “the ability to sound fluent in natural language, like playing chess, is not conclusive proof of general intelligence,” wrote Mitchell.
The latest convincing-sounding achievement is described by Cameron Jones and Benjamin Bergen of UC San Diego in a paper published on the arXiv pre-print server this week, titled “Large Language Models Pass the Turing Test.”
Also: OpenAI expands GPT-4.5 rollout. Here’s how to access (and what it can do for you)
The paper is the latest installment in an experiment that Jones and Bergen have been running for years with participation from UC San Diego undergrads from the department.
As the authors note, there have been decades of work on the problem. To date, there have been “more than 800 separate claims and counter-arguments having been made” about computers passing the test.
How The Turing Test works
The Turing Test was classically conceived by Turing as a round of passing text messages between a human “judge” and two “witnesses,” one a human and one a computer.
The computer and human witnesses were charged with convincing the human judge that they were human by the messages each sent. The judge knows only one of the two is human, but not which is which, and has to guess.
That three-way form is essential. It means that if the judge mistakenly deems a computer human, then the judge was also missing the clues they should have gotten from the human about humanness.
Also: With AI models clobbering every benchmark, it’s time for human evaluation
In other words, it’s a test as much about how humans perceive and believe as it is about computer functioning.
Jones and Bergen’s updated design is posted online, and you can play the game yourself, as either judge or witness.
As in prior versions, the human students at UC San Diego were asked to engage in a text chat app. Some of the students played the judge and some the human witness in randomized five-minute chat exchanges, while GPT-4.5 played the computer witness. Each judge had simultaneous conversations with both a human and a computer witness to fit the three-way form of Turing’s formulation.
To play the computer witness, multiple large language models were used, input with the judge’s challenges in a prompt. GPT-4.5 was tested with two different prompts. One prompt was given minimal information. But a second, called PERSONA, was given a lot of extra details in the prompt, including “additional instructions on what kind of persona to adopt in responding to the interrogator: specifically a young person who is introverted, knowledgeable about internet culture, and uses slang,” the authors write.
The results
Three other AI programs were tested: OpenAI’s previous model, GPT-4o; Meta Platforms’s Llama 3.1 405b; and a very ancient chat bot program from the 1960s, called ELIZA. ELIZA was included because, as a more primitive program, it could act as a check on whether the large language models fooled a human simply by random chance or because they were actually better programs.
GPT-4.5, report Jones and Bergen, had a “win rate” of 73%, meaning it fooled the human judge into declaring it human 73% of the time. By contrast, the version of GPT-4.5 with the minimalist prompt and the three other AI programs did nowhere near as well, with scores ranging from 21% to 56%, even when the other language models were given a similarly rich PERSONA prompt.
Also: Are ChatGPT Plus or Pro worth it? Here’s how they compare to the free version
Moreover, GPT-4.5 was way above a test of GPT-4 last year by Jones and Bergen. It had a win rate of only 54%, just above random chance.
Jones and Bergen conclude that “interrogators were not only unable to identify the real human witness, but were in fact more likely to believe this model was human than that other human participants were.”
Is the test actually a measure of intelligence?
So, you might ask what it means that humans do a poor job of telling a computer and a person apart based on chat messages.
The “most controversial question” about Turing’s problem over the decades is whether it is actually measuring intelligence, Jones and Bergen acknowledge.
One way of looking at it is that machines have gotten so good that they can “adapt their behaviour to different scenarios that makes them so flexible: and apparently so capable of passing as human,” they observe. The PERSONA prompt, created by humans, is something to which GPT-4.5 “adapted” itself to win.
Again, it’s a genuine technical breakthrough in the AI model’s capabilities.
Also: ChatGPT Plus is free for students now – how to grab this deal before finals
However, a big complaint with the test is that humans might simply be bad at recognizing intelligence. The authors conclude their experiment is evidence of that, at least partially.
They note that 23% of the time, the older ELIZA program fooled human judges. As they relate, that was not because it was somehow obviously smarter. “Many participants selected ELIZA because it did not meet their expectations of an AI system (e.g. ‘they were sarcastic’ or ‘I don’t think AI would be so rude’.),” they write.
Those guesses, they write, “suggest that interrogators’ decisions incorporate complex assumptions about how humans and AI systems might be likely to behave in these contexts, beyond simply selecting the most intelligent-seeming agent.”
In fact, the human judges didn’t ask very much about knowledge in their challenges, even though Turing thought that would be the main criterion. “[O]ne of the reasons most predictive of accurate verdicts” by the human judge, they write, “was that a witness was human because they lacked knowledge.”
Sociability, not intelligence
All this means humans were picking up on things such as sociability rather than intelligence, leading Jones and Bergen to conclude that “Fundamentally, the Turing test is not a direct test of intelligence, but a test of humanlikeness.”
For Turing, intelligence may have appeared to be the biggest barrier for appearing humanlike, and hence to passing the Turing test. But as machines become more similar to us, other contrasts have fallen into sharper relief, to the point where intelligence alone is not sufficient to appear convincingly human.
Left unsaid by the authors is that humans have become so used to typing into a computer — to a person or to a machine — that the Test is no longer a novel test of human-computer interaction. It’s a test of online human habits.
One implication is that the test needs to be expanded. The authors write that “intelligence is complex and multifaceted,” and “no single test of intelligence could be decisive.”
Also: Gemini Pro 2.5 is a stunningly capable coding assistant – and a big threat to ChatGPT
In fact, they suggest the test could come out very different with different designs. Experts in AI, they note, could be tested as a judge cohort. They might judge differently than lay people because they have different expectations of a machine.
If a financial incentive were added to raise the stakes, human judges might scrutinize more closely and more thoughtfully. Those are indications that attitude and expectations play a part.
“To the extent that the Turing test does index intelligence, it ought to be considered among other kinds of evidence,” they conclude.
That suggestion seems to square with an increasing trend in the AI research field to involve humans “in the loop,” assessing and evaluating what machines do.
Is human judgement enough?
Left open is the question of whether human judgment will ultimately be enough. In the movie Blade Runner, the “replicant” robots in their midst have gotten so good that humans rely on a machine, “Voight-Kampff,” to detect who’s human and who’s robot.
As the quest goes on to reach AGI, and humans realize just how difficult it is to say what AGI is or how they would recognize it if they stumbled upon it, perhaps humans will have to rely on machines to assess machine intelligence.
Also: 10 key reasons AI went mainstream overnight – and what happens next
Or, at the very least, they may have to ask machines what machines “think” about humans writing prompts to try to make a machine fool other humans.
Get the morning’s top stories in your inbox each day with our Tech Today newsletter.