Anyone who’s played around with ChatGPT knows that the software tends to hallucinate — it makes up people, scenarios and facts from thin air. ChatGPT has been known to come up with research papers that don’t exist, or come up with information that is completely untrue. But now a research scientist at OpenAI has revealed what causes this behaviour.
“One reason LLMs hallucinate is that humans tend not to express uncertainty in writing,” says Mark Chen who’s a research scientist at OpenAI. “If you don’t know the answer, why bother wasting the ink? Not surprising that when our models don’t know the answer, they emulate the data and answer confidently anyway,” he says.
What Chen says makes a lot of sense. ChatGPT is trained on lots of data including books and articles on the internet, and these places aren’t usually a place where people express uncertainty. People only write a book or an article when they know the answer — there aren’t many articles or a books where the author simply doesn’t know how something works, and writes lines and lines on his quandary. As such, ChatGPT is trained on lots of confident answers and assertions, and it brings this confidence into its answers, even when it doesn’t actually know what the answer is. And that appears to involve making up completely false information to maintain the authoritative and confident tone that it’s been trained on.
There are now steps being taken to make sure LLMs hallucinate less than before. Bing, for instance, provides clickable links as citations on many of its claims, which can help users figure out where a bit of information came from. There also appears to have been some Trust and Safety tinkering by OpenAI, and ChatGPT now seems to say it’s unable to answer some questions. But LLMs do still hallucinate once in a while, and as an OpenAI researcher points out, it’s mainly thanks to the sort of data they were trained on.