Errors tend to occur with AI-generated content
Paul Taylor/Getty Images
AI chatbots from tech companies such as Openai and Google have undergone so-called inference upgrades over the past few months. Ideally, it would be ideal to provide a reliable answer, but recent tests suggest that things can be worse than previous models. Errors made by chatbots known as “hagatsuki” have been a problem from the start and are apparently unable to remove them.
Hallucination is the comprehensive term for certain types of mistakes made by large-scale language models (LLMS) made by power systems such as Openai’s ChatGPT and Google’s Gemini. It is best known as an explanation of how they sometimes present misinformation as true. But it can also effectively refer to an answer generated in an accurate AI, but it has no effect on the question being asked.
Openai Technical Report Evaluating the latest LLMS showed that the O3 and O4-MINI models released in April had significantly higher hallucination rates than the company’s previous O1 models announced in late 2024. For example, if O4-MINI summarized 33% of the 48% time, then 33% hallucinations of O3 were summarized. In comparison, the hallucination rate for O1 was 16%.
The problem is not limited to Openai. It’s popular Leaderboard We present several “inference” models from companies that evaluate hallucination rates, including the DeepSeek-R1 model from developer DeepSeek. Hallucination rate Compared to previous models from the developer. This type of model goes through multiple steps to demonstrate a set of inference before responding.
Openai says the reasoning process is not responsible for it. “We are actively working to reduce the rate of hallucinations seen in O3 and O4-Mini, but hallucinations are essentially less common in inference models,” says an Openai spokesperson. “We will continue our hallucination research across all models to improve accuracy and reliability.”
Some of the potential applications of LLMS can be derailed by hallucinations. Models that consistently state falsehoods and require fact-checking are not useful research assistants. A paralysed bot cites an imaginary case will get your lawyer in trouble. Customer service agents claiming that outdated policies are still active will cause a company headache.
However, AI companies initially argued that this problem would be resolved over time. In fact, after they first launched, the models tended to reduce hallucinations with each update. However, the high hallucination rates in recent versions complicate the story.
Vectara’s leaderboard ranks models based on the de facto consistency in summarizing a given document. This says that, at least on Openai and Google’s systems, “hastisation rates are roughly the same for inference and irrational models.” Forest Shen Bao At Vectara. Google did not provide any additional comments. For leaderboard purposes, the number of specific hallucination rates is less important than the overall ranking of each model, Bao said.
However, this ranking may not be the best way to compare AI models.
For one thing, we confuse different types of hallucinations. Vectara Team It was pointed out It was 14.3% of hallucinations in the deepseek-r1 model, but most of these were “benign.” It is actually supported by logical reasoning and world knowledge, but it does not actually exist in the original text that the bot was asked to summarize. Deepseek did not provide any additional comments.
Another problem with this kind of ranking is that tests based on text summaries say “nothing about the percentage of incorrect output.” [LLMs] It’s used for other tasks.” Emily Bender At Washington University. She says that leaderboard results may not be the best way to judge this technology, as LLM is not specifically designed to summarise texts.
These models work by repeatedly answering the question “what is the next word” to formulate an answer to the prompt, so they don’t process the information in the usual sense that tries to understand the information available in the body of the text, vendors say. However, many tech companies frequently use the term “hastisation” when describing output errors.
“Hazardization as a term is double problematic,” says Bender. “On the one hand, it suggests that the false output is abnormal and perhaps mitigating, but the rest of the time is grounded, reliable and reliable, while on the other hand it works to personify the machine. [and] Large language models are not aware of anything. ”
Arvind Narayanan Princeton University says the issue goes beyond hallucination. Models can also make other mistakes, such as using unreliable sources or using outdated information. Also, throwing more training data and computing power with AI isn’t necessarily helpful.
The outcome may have to live with error-prone AI. Narayanan said on social media post In some cases, it may be best to use only such models for tasks when fact-checking AI answers. But the best move might be to completely avoid relying on AI chatbots to provide factual information, vendors say.
topic: