Artificial Intelligence (AI), particularly large-scale language models like GPT-4, have shown impressive performance in inference tasks. But does AI really understand abstract concepts or is it just mimicking patterns? New research from the University of Amsterdam and the Santa Fe Institute reveals that GPT models work well in some analogy tasks, while changing problems reveal a lack of them, highlighting key weaknesses in AI’s inference capabilities I’m doing it.
Similar inference is the ability to draw comparisons between two different things based on similarities in a particular aspect. This is one of the most common ways for humans to understand the world and make decisions. Examples of similar reasoning: do cups make coffee like soups do? (The answer is: bowl)
Large-scale language models like GPT-4 work well in a variety of tests, including those that require similar inference. But can AI models really be involved in robust inference in general, or do they overrely rely on patterns from training data? The study by language and AI experts Martha Lewis (Institute of Logic, Language and Computational Research at the University of Amsterdam) and Melanie Mitchell (Institute of Santa Fe) shows whether GPT models are as flexible and robust as humans? I looked into it. “This is extremely important as AI is increasingly used in the real world for decision-making and problem solving,” explains Lewis.
Compare AI models with human performance
Lewis and Mitchell compared the performance of humans and GPT models on three different types of analogy problems.
- Character sequence – Identifies the pattern of the character sequence and completes correctly.
- Number Matrices – Analyze the numerical patterns and determine the missing numbers.
- Story analogy – Understanding which of the two stories best corresponds to best corresponds to a particular example.
Systems that really understand analogies should maintain high performance even in variations
In addition to testing whether the GPT model could solve the original problem, this study looked at how well it worked when the problem was subtly modified. “Systems that truly understand analogy should maintain high performance even in these variations,” the author states in the article.
GPT models suffer from robustness
Humans maintained high performance in most fixed versions of the problem, but GPT models worked well with standard analogy issues, while struggling with variations. “This suggests that AI models are often less flexible than humans, and their reasoning is not about true abstract understanding, but about pattern matching.” explains Lewis.
In the numerical matrix, the GPT model showed a significant performance drop when the position of the missing number was changed. Humans had no difficulty with this. In story analogy, GPT-4 tended to choose the first given answer as correct more often, but humans were not affected by the order of responses. Furthermore, GPT-4 suggests that when important elements of the story are paraphrased, they have more trouble than humans, relying on surface-level similarity rather than deeper causal reasoning. Masu.
In a simpler analogy task, the GPT model had reduced performance degradation when tested with modified versions, but humans were consistent. However, due to the more complex similar inference tasks, both humans and AI struggled.
It’s weaker than human cognition
This study challenges the broad assumption that AI models like GPT-4 can reason in the same way as humans. “The AI model shows impressive capabilities, but this doesn’t mean they really understand what they’re doing,” concludes Lewis and Mitchell. “The ability to generalize across variations is still significantly weaker than human cognition. GPT models often rely on superficial patterns rather than deep understanding.
This is an important caveat for using AI in critical decision-making areas such as education, law, and healthcare. AI can be a powerful tool, but it is not yet a replacement for human thinking and reasoning.
Article details
Martha Lewis and Melanie Mitchell, 2025 “Evaluating the robustness of similar inferences in large-scale language models,” Machine learning research transactions.
If you find this piece useful, consider supporting your work with a small, one-off, or monthly donation. With your contributions, we enable you to continue bringing you reliable, accurate, thought-provoking science and medical news. Independent reporting requires time, effort and resources, and with your support we can continue to explore the stories that matter to you. Together, we can ensure that important discoveries and developments reach those who need them most.