llm performance counter factual tasks

LLMs Might Be Just Reciting Information Fed To Them Instead Of Actually Reasoning, Study Finds

LLMs have transfixed the world with their capabilities over the last few months. Their abilities to chat in natural language, and even solve some reasoning problems, have led many to speculate that LLMs might be the first sparks of AGI. But a paper finds that the “reasoning” power of LLMs might just be due to memorization of large amounts of data, and not a real understanding of the underlying concepts.

“The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining?” the paper asks.

The paper tests this out by giving different LLMs common tasks, but with some subtle but important variations. For instance, most LLMs can perform simple addition, but the paper asks the LLMs to perform addition with numbers represented in different bases, including base 8 and base 16. Another task required LLMs to write Python code by assuming the starting index in arrays to be 1 instead of 0. For spatial reasoning tasks, the researchers changed the axis to represent positions, by either swapping the axes or randomly rotating them.

If LLMs truly understood the answers they were giving, they should’ve been able to account for these minor differences.– these tasks require the same basic reasoning, but with some changes. But if LLMs had merely memorized the answers instead of truly understanding the underlying logic, they were likely to struggle.

“We see a consistent pattern where LMs perform substantially worse on the counterfactual task variants, both with and without 0-shot CoT (Chain of Thought),” the paper found. “For most cases, LMs exhibit an above-random counterfactual performance, suggesting some degree of the targeted ability. However, when the counter-factual accuracy is high, usually the case for GPT4 and in select settings for other models too, the default-counterfactual gaps demonstrate limitations in the abstract capacity to solve the target task,” it says.

The paper goes on to compare the performance of models including GPT-4, GPT 3.5, Claude and Palm 2 on both “normal” and “modified” tasks. GPT-4 seems to have the best performance, with the lowest drop-off between normal and modified tasks. For instance, all models did really well at addition when the numbers were in base 10, but accuracy dropped by around half for GPT-4, and by as much as 90% for Claude when the numbers were provided in base 8. Similarly, the accuracy dropped by more than 50% when the LLMs were asked to write a Python program with the array starting index as 1 instead of 0.

This indicates that many LLMs might not truly understand the problems that they purport to solve, and often given recited solutions based on the massive amounts of data they’ve seen. But the paper also goes on to say that it’s likely that human performance will too degrade for these tasks — a human being could also take much longer, and find it much harder to add numbers in base 8.

As such, it’s unclear whether this result indicates that LLMs aren’t as smart as they’re made out to be, or they’re more like humans in their weaknesses. But it does appear that if models are performing worse on slightly altered tasks, they don’t yet completely understand the underlying mechanisms of the problems they seem to solve, and it could be a while before LLMs eventually turn into something resembling AGI. But regardless, this inability of LLMs to do well on counter factual tasks is an interesting datapoint towards understanding how these fascinating bits of technology really work.