DeepSeek challenges OpenAI's o1 in chain of thought – but it's missing a few links


akinbostanci/Getty Images

Consider a train leaving Chicago traveling west at seventy miles an hour, and another train leaving San Francisco traveling east at eighty miles per hour. Can you figure out when and where they’ll meet?

It’s a classic grade school math problem, and artificial intelligence (AI) programs such as OpenAI’s recently released “o1” large language model, currently in preview, will not only find the answer but also explain a little bit about how they arrived at it. 

The explanations are part of an increasingly popular approach in generative AI known as chain of thought.

Although chain of thought can be very useful, it also has the potential to be totally baffling depending on how it’s done, as I found out from a little bit of experimentation.

Also: OpenAI expands o1 model availability – here’s who gets access and how much

The idea behind chain-of-thought processing is that the AI model can detail the sequence of calculations it performs in pursuit of the final answer, ultimately achieving “explainable” AI. Such explainable AI could conceivably give humans greater confidence in AI’s predictions by disclosing the basis for an answer.

For context, an AI model refers to part of an AI program that contains numerous neural net parameters and activation functions that comprise the key elements for how the program functions.

To explore the matter, I put OpenAI’s o1 against R1-Lite, the newest model from China-based startup DeepSeek. R1-Lite goes further than o1 to give verbose statements of the chain of thought, which contrasts o1’s rather terse style.

Also: ChatGPT writes my routine in 12 top programming languages. Here’s what the results tell me

DeepSeek claims R1-Lite can beat o1 in several benchmark tests, including MATH, a test developed by U.C. Berkeley comprised of 12,500 math question-answer sets.

AI luminary Andrew Ng, founder of Landing.ai, explained that the introduction of R1-Lite is “part of an important movement” that goes beyond simply making AI models bigger to instead making them do extra work to justify their results.

But R1-Lite, I found, can also be baffling and tedious in ways o1 is not. 

Also: AI-driven software testing gains more champions but worries persist

I submitted the above famous trains math question to both R1-Lite and o1 preview. You can try R1-Lite for free by creating a free account at DeepSeek’s website, and you can acess o1 preview as part of a paid ChatGPT account with OpenAI. (R1-Lite is not yet released as open-source, though a number of other DeepSeek projects are available on GitHub.)

deepseek-versus-openai-if-a-train-leaves-chicago-2024.png

Both chat bots start out with fairly simple routes to a solution to the famous trains problem of grade-school math.

Both models came up with similar answers, though the o1 model was noticeably faster, taking five seconds to spit out an answer, while DeepSeek’s R1-Lite took 21 seconds (the two models each tell you how long they “thought”). o1 also used a more accurate number of miles between Chicago and San Francisco in its calculation.

The more interesting difference came with the next round.

Also: How well can OpenAI’s o1-preview code? It aced my 4 tests – and showed its work in surprising detail

When I asked both models to compute roughly where the two trains would meet, meaning what U.S. town or city, the o1 model quickly produced Cheyenne, Wyoming. In the process, o1 telegraphed its chain of thought by briefly flashing short messages such as “Analyzing the trains’ journey,” or “Mapping the journey,” or “Determining meeting point.” 

These weren’t really informative but rather an indicator that something was going on. 

In contrast, the DeepSeek R1-Lite spent nearly a minute in its chain of thought, and, as in other cases, it was highly verbose, leaving a trail of “thought” descriptions totaling 2,200 words. These became increasingly convoluted as the model proceeded through the chain. The model started simply enough, positing that wherever each train got at the end of 12 hours would be roughly where both trains would be close to one another, someplace between the two origins.

But then DeepSeek’s R1-Lite went completely off the rails, so to speak. It tried many weird and wacky ways to compute the location and narrated each method in excruciating detail. 

deepseek-versus-openai-where-do-the-trains-meet-2024.png

While OpenAI’s o1 model wraps up its work fairly quickly, DeepSeek’s R1-Lite, left, goes through a long and winding “thought” process that becomes increasingly complicated and distracting.

First, it computed distances from Chicago to multiple different cities on the way to San Francisco, as well as the distances between cities, to approximate a location. 

Also: I tested 9 AI content detectors – and these 2 correctly identified AI text every time

It then resorted to using longitude on the map and computing degrees of longitude the Chicago train traveled. It then backed away and tried to compute distances by driving distance.

In the midst of all this, the model spat out the statement, “Wait, I’m getting confused” — which is probably true of the human watching all this. 

deepseek-confused-by-its-own-reasoning-2024

DeepSeek’s R1-Lite expresses confusion — the AI model is not literally confused, but the human being using it might be.

By the time R1-Lite produced the answer — “in western Nebraska or Eastern Colorado,” which is an acceptable approximation — the reasoning was so abstruse it was no longer “explainable” but discouraging.

Also: AI isn’t hitting a wall; it’s just getting too smart for benchmarks, says Anthropic

By explaining a supposed reasoning process in laborious detail, unlike the o1 model, which keeps the answer rather brief, DeepSeek’s R1-Lite actually ends up being complex and confusing. 

It’s possible that with more precise prompts that include details like actual train routes, the chain of thought could be a lot cleaner. Access to external databases for map coordinates could also lead the R1-Lite to have fewer links in the chain of thought.

The test goes to show that in these early days of chain-of-thought reasoning, humans who work with chatbots are likely to end up confused even if they ultimately get an acceptable answer from the AI model.





Source link