Hallucinations have emphasize – and they are there to stay

Errors tend to occur in content generated by AI
Images Paul Taylor / Getty
AI chatbots of technological companies such as Openai and Google have obtained so -called reasoning upgrades in recent months – ideally to improve them to give us answers to which we can trust, but recent tests suggest that they sometimes make worse than previous models. The errors made by chatbots, called “hallucinations”, were a problem from the start, and it becomes clear that we do not get rid of it.
Hallucination is a term coverage for certain types of errors made by large language models (LLM) that food systems such as Openai Chatppt or Google Gemini. It is better known as a description of how they sometimes have false information as true. But it can also refer to an answer generated by AI which is factually exact, but not really relevant for the question which has been asked, or which does not follow the instructions in another way.
An OPENAI technical report evaluating its latest LLMS has shown that its O3 and O4-Mini models, which were published in April, had significantly higher hallucination rates than the previous O1 model of the company which was released at the end of 2024. For example, during the summary of public facts on people, O3 hallucinated 33% of the time, while O4-Mini was therefore 48 for time. In comparison, O1 had a hallucination rate of 16%.
The problem is not limited to Openai. A popular classification of the company Vectara which assesses the hallucination rates indicates certain models of “reasoning” – including the Deepseek -R1 model of the Deepseek developer – has seen two -digit increases in hallucination rates compared to the previous models of their developers. This type of model goes through several steps to demonstrate a reasoning line before responding.
Openai says that the reasoning process is not to be blamed. “Hallucinations are not intrinsically more widespread in reasoning models, although we are actively working to reduce higher hallucination rates that we saw in O3 and O4-Mini,” said an Openai spokesperson. “We will continue our research on hallucinations on all models to improve precision and reliability.”
Some potential applications for LLMs could be derailed by hallucination. A model that systematically indicates lies and requires verification of facts will not be a useful research assistant; A parajurist bot that quotes imaginary cases will cause trouble to lawyers; A customer service agent who claims that obsolete policies are always active to create headaches for the company.
However, AI societies initially said that this problem would be cleared over time. Indeed, after having been launched for the first time, the models tended to hallucinate less with each update. But high hallucination rates of recent versions complicate this story – whether reasoning is faultless or not.
Vectara’s classification classifies the models according to their factual consistency in the summary of the documents given to them. This has shown that “hallucination rates are almost the same for reasoning compared to non -referring models”, at least for Openai and Google systems, explains Forrest Sheng Bao at Vectara. Google did not provide additional comments. For the end of the classification, specific hallucination rate numbers are less important than the overall classification of each model, explains Bao.
But this ranking may not be the best way to compare AI models.
On the one hand, this confuses different types of hallucinations. The Vectara team stressed that, although the Deepseek-R1 model has hallucinated 14.3% of the time, most of them were “mild”: responses that are factually supported by logical reasoning or global knowledge, but not really present in the original text that the bot was invited to summarize. Deepseek did not provide additional comments.
Another problem with this type of classification is that tests based on the text summary “says nothing about the incorrect output rate when [LLMs] are used for other tasks, ”explains Emily Bender at Washington University. She says that classification results may not be the best way to judge this technology because LLMs are not specifically designed to summarize the texts.
These models work by answering the question of “what is likely the next word” to formulate answers to the prompts, and they therefore do not process information in the usual sense of trying to understand what information is available in a set of text, explains Bender. But many technological companies always frequently use the term “hallucinations” during the description of output errors.
“” Hallucination “as a term is doubly problematic”, explains Bender. “On the one hand, this suggests that incorrect outputs are an aberration, perhaps one that can be attenuated, while the rest of the time when the systems are founded, reliable and trustworthy. On the other hand, it works for anthropomorphise the machines – hallucination refers to perceiving something that is not there [and] Large languages do not perceive anything. »»
Arvind Narayanan at Princeton University says that the problem goes beyond hallucination. The models also sometimes make other errors, such as unreliable sources or the use of obsolete information. And the simple fact of launching more training and computer power data at AI has not necessarily helped.
The result is that we may have to live with the subject of errors. Narayanan said in an article on social networks that it may be preferable in some cases to use such models only for tasks that the verification of the facts of the AI response would always be faster than doing the research yourself. But the best movement may be to completely avoid counting on AI chatbots to provide factual information, explains Bender.
Subjects:



