
Errors tend to arise in content generated by AI
Paul Taylor/Getty images
The chatbots of AI of technological companies such as Openai and Google have received the so -called reasoning updates in recent months, ideally to improve the topic to give us answers in which we can trust, but recent tests suggest that sometimes they are worse than before. The mistakes made by the chatbots, known as “hallucinations”, have been a problem from the beginning, and it is clear that we could never get rid of them.
Half is a general term for certain types of errors made by large language models (LLM) that energy systems such as OpenAi chatgpt or Google Gemini. It is known a lot as a description of the way they sometimes present false information as true. But it can also refer to an answer generated by AI that is objectively precise, but it is not really relevant to the question that was asked, or does not follow the instructions in some other way.
An OpenI technical report that evaluates its last LLMS showed that its O3 and O4-mini models, which were published in April, had significant hallucination rates higher than the previous O1 model of the company that came out in late Avampro, people hallucinated 33 percent of the time, while O4-mini did 48 percent of the time. In comparison, O1 had a hallucination rate or 16 percent.
The problem is not limited to OpenAI. A popular classification table of the Vectara company that evaluates hallucination rates indicates some “reasoning” models that include the Deepseek-R1 model of two-digit increases Develepeekseek Saek in hallucination rates compared. This type of model goes through multiple steps to demonstrate a line or refer before the answer.
Operai says that the reasoning process is not guilty. “Hallucinations are not inherently more frequent in reasoning models, although we are actively working to reduce the highest hallucination rates we saw in O3 and O4-mini,” says an Openai spokesman. “We will continue our research on hallucinations in all models to improve precision and reliability.”
Some potential applications for LLM could be derailed by hallucination. A model that constantly establishes false hoods and requires verification of facts will be a useful research assistant; A legal assistant boat that cites imaginary cases will put lawyers in trouble; A customer service agent who states that obsolete policies are still active will create headaches for the company.
However, the initial AI companies said that this problem would be clarified over time. In fact, after they first launched, the models tended to hallucinate less with each update. But the high rates of hallucination of versions recently complicate that narrative, whether or not the fault is the fault.
The empty classification table classifies the models based on their objective consistency in the summary of documents given to them. This showed that “hallucination rates are almost the same for reasoning versus the models that are not classification”, at least for OpenAi and Google systems, says Forrest Sheng Bao in Vecara. Google did not provide additional comments. For the purposes of the classification table, the specific hallucination rate numbers are less important than the general classification of each model, says Bao.
But this classification may not be the best way to compare AI models.
On the one hand, it combines different types of hallucinations. The neighborhood team said that, in this regard, the Deepseek-R1 model amazed 14.3 percent of the time, most of them were “benign”: responses that are backed by logical reasoning or world knowledge, but not the realis. Depseek did not provide additional comments.
Another problem with this type of classification is that tests based on the text summary “say nothing about the rate of incorrect outputs when [LLMs] They are used for other tasks, ”says Emily Bender at the University of Washington. She says that the results of the classification table may not be the best way to judge this technology because the LLMs are not specifically designed to summarize the texts.
These models operate repeatedly the question of “what is an upcoming word of the next word” to formulate answers to the indications, so they are not processing information in the usual sense of trying to understand what information can be old in a body of text. But many technological companies still frequently use the term “hallucinations” when describing output errors.
“Hallucination” as a term is doubly problematic, “says Bender.” On the one hand, it suggests that incorrect exits are an aberration, perhaps one that can be mitigated, while the rest of the time the systems are grounded, reliable and reliable. On the other hand, it works to perceive machines [and] Large language models do not perceive anything. “
Arvind Narayanan at Princeton University says the problem goes beyond hallucination. Models also sometimes make other mistakes, such as resorting to unreliable sources or using outdated information. And simply throw more training and computer energy data in AI has not necessarily helped.
The result is that we have to live with errors. Narayanan said in a publication on social networks that it can be better in some cases only to use such models for tasks when verifying the AI response would still be faster than research yourself. But the best movement can be completed avoid trusting AI chatbots to provide objective information, says Bender.
Topics:
]

