- CCDE Evolves: New Specialist Certifications and AI Infrastructure Focus
- The best Samsung Galaxy S25 cases of 2025
- The Microsoft 365 Copilot launch was a total disaster
- Modernizing bp’s application landscape with AI
- My favorite Kindle accessory seriously upgraded my reading experience (and it only costs $20)
Roll over, Darwin: How Google DeepMind's 'mind evolution' could enhance AI thinking
One of the big trends in artificial intelligence in the past year has been the employment of various tricks during inference — the act of making predictions — to dramatically improve the accuracy of those predictions.
For example, chain-of-thought — having a large language model (LLM) spell out the logic of an answer in a series of statements — can lead to increased accuracy on benchmark tests.
Such “thinking” has apparently led to breakthroughs in accuracy on abstract tests of problem-solving, such as OpenAI’s GPTo3’s high score last month on the ARC-AGI test.
Also: OpenAI’s o3 isn’t AGI yet but it just did something no other AI has done
It turns out, however, that LLMs still fall short on very practical tests, something as simple as planning a trip.
Google DeepMind researchers, led by Kuang-Huei Lee, pointed out in a report last week that Google’s Gemini and OpenAI’s GPTo1, the companies’ best respective models, fail miserably when tested on TravelPlanner, a benchmark test introduced last year by scholars at Fudon University, Penn State, and Meta AI.
Tasked with formulating a travel itinerary to meet requirements such as cities visited, time spent, and travel budget, the two AI models were successful only 5.6% and 11.7% of the time, respectively.
Given the weak results of top models, Lee and team propose an advance beyond chain-of-thought and similar approaches that they say is dramatically more accurate on tests such as TravelPlanner.
Called “mind evolution,” the new approach is a form of searching through possible answers — but with a twist.
The authors adopt a genetically inspired algorithm that induces an LLM, such as Gemini 1.5 Flash, to generate multiple answers to a prompt, which are then evaluated for which is most “fit” to answer the question.
Also: Google’s new Gemini models achieve ‘near-perfect recall’
In the real world, evolution happens via natural selection, where entities are evaluated for “fitness” in their environment. The most fit combine to produce offspring, and occasionally there are beneficial genetic mutations. The whole process leads to progressively more “optimal” organisms.
Likewise, Lee and team’s mind evolution causes the multiple answers of the LLM to be evaluated for how well they match the prompted question. That process then forces the LLM to modify its output to be better — a kind of recombination and mutation as seen in natural selection. At the same time, low-quality output is “retired,” like bad entities being culled from the species via natural selection.
The point of such an evolutionary approach is that it’s hard to find good solutions in one stroke, but it’s relatively easy to weed out the bad ones and try again. As they write, “This approach exploits the observation that it is often easier to evaluate the quality of a candidate solution than it is to generate good solutions for a given problem.”
The key is how best to evaluate the AI model’s multiple answers. To do so, the authors fall back on a well-established prompting strategy. Instead of just chain-of-thought, they have the model conduct a dialogue of sorts.
The LLM is prompted to portray two personas in dialogue, one of which is a critic, and the other, an author. The author proposes solutions, such as a travel plan, and the critic points out where there are flaws.
Also: AI transformation is a double-edged sword. Here’s how to avoid the risks
“We leverage an LLM to generate an improved solution by organizing a critical conversation between a ‘critic’ character and an ‘author’ character,” write Lee and team. “Each conversational turn is structured as a prompt-driven process, where solutions are refined based on critical feedback,” they write.
Quite long prompts are used, showing the LLM examples of proposed solutions and where they ran into problems. The prompt gives the model instructions as to how to play the two roles, such as, “Jane, remember you’re the best in the world at analyzing flawed travel plans,” and, “John, remember that you’re the best in the world at writing budget travel plans based on Jane’s analyses.”
The Gemini 1.5 Flash is tested on multiple planning benchmarks. On TravelPlanner, Gemini with the mind evolution approach soars above the typical 5.6% success rate to reach 95.2%, they relate. And, when they use the more powerful Gemini Pro model, it’s nearly perfect, 99.9%.
Also: Writers voice anxiety about using AI. Readers don’t seem to care
The results, Lee and team write, show “a clear advantage of an evolutionary strategy” combining both a search for possible solutions very broadly speaking, and also using the language model to refine those solutions with the author-critic roles.
The bad news is that mind evolution requires much more computing power than the normal Gemini approach. The Flash version with mind evolution makes 167 API calls to the model versus a single call when Flash is operating normally. Mind Evolution also eats up three million tokens because of the very long prompts versus 9,000 for normal Gemini.
The good news is that while it demands more compute, mind evolution is still more efficient than other kinds of search strategies that inspect many possible answers from the AI model.
In fact, mind evolution gets steadily better the more possible outputs it evaluates, as you’d expect from something that’s supposed to be evolving to be more fit. It seems the repeated critical dialogue is contributing in some concrete way.
Also: How the ‘ChatGPT of healthcare’ could accelerate rheumatoid arthritis treatment
“Mind evolution is consistently more effective than the baseline strategies with respect to the number of candidate solutions needed to achieve a specified level of success rate (or average task performance),” the authors note.
In a fun twist, Lee and team add to the mix their own novel benchmark, called StegPoet, which tests the ability of Gemini to perform steganography, the practice of hiding a message in a block of text. (Not to be confused with “stenography,” the practice of transcribing speech via shorthand.)
In the authors’ version of steganography, a series of two-digit numbers each have to be assigned to ordinary words, and then the words have to be composed into a poem to conceal the numeric code. The problem becomes more difficult as the string of numbers becomes longer and as each number is repeated more often.
Interestingly, StegPoet turns out to be pretty challenging even for mind evolution. Gemini Flash using the evolution trick gets it right only 43.3% of the time, less than random chance. And Gemini Pro only achieves 79%. Both, however, are vastly better than either unaided Gemini or the typical search strategies.
The most important observation of Lee and team’s mind evolution is that inference is a rich field of invention that is finding new ways to get better results beyond simply crafting better prompts.
One important omission in the authors’ work is how to take the very large computing budget of mind evolution and slim it down. Every new approach that builds complex prompts with millions of tokens only increases the cost of getting better answers. At some point, putting all that on a budget becomes important.