By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
vantagefeed.comvantagefeed.comvantagefeed.com
Notification Show More
Font ResizerAa
  • Home
  • Politics
  • Business
  • Tech
  • Health
  • Environment
  • Culture
  • Caribbean News
  • Sports
  • Entertainment
  • Science
Reading: Apple engineers show how fragile AI’s ‘reasoning’ is
Share
Font ResizerAa
vantagefeed.comvantagefeed.com
  • Home
  • Politics
  • Business
  • Tech
  • Health
  • Environment
  • Culture
  • Caribbean News
  • Sports
  • Entertainment
  • Science
Search
  • Home
  • Politics
  • Business
  • Tech
  • Health
  • Environment
  • Culture
  • Caribbean News
  • Sports
  • Entertainment
  • Science
Have an existing account? Sign In
Follow US
vantagefeed.com > Blog > Technology > Apple engineers show how fragile AI’s ‘reasoning’ is
Apple engineers show how fragile AI’s ‘reasoning’ is
Technology

Apple engineers show how fragile AI’s ‘reasoning’ is

Vantage Feed
Last updated: October 16, 2024 2:27 am
Vantage Feed Published October 16, 2024
Share
SHARE

For some time now, companies like OpenAI and Google have Promote advanced “reasoning” capabilities as next big step With the latest artificial intelligence models. But now, new research by six Apple engineers shows that the mathematical “reasoning” displayed by sophisticated large-scale language models can be extremely fragile and unreliable in the face of seemingly trivial changes to common benchmark problems. It has been shown that sexual performance may be reduced.

The vulnerabilities highlighted by these new results suggest that LLM’s use of probabilistic pattern matching lacks the formal understanding of the underlying concepts required for truly reliable mathematical reasoning ability. It helps corroborate previous research. Based on these results, the researchers hypothesize that “current LLMs are incapable of true logical reasoning.” “Instead, they attempt to reproduce the inference steps observed in the training data.”

mix it up

“GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” (currently available) As pre-printed paper—6 Apple researchers begin: GSM8K’s standardized set of over 8,000 elementary school level math word problems,In other words often used as a benchmark Addresses the complex reasoning capabilities of modern LLMs. We then take a new approach by modifying parts of that test set to dynamically replace certain names and numbers with new values. So the question that Sophie got 31 building blocks for her nephew in GSM8K could become the question that Bill got 19 building blocks. His brother in the new GSM symbolic rating.

This approach helps avoid “data pollution” that can arise from static GSM8K questions that are input directly into the AI ​​model’s training data. At the same time, these incidental changes do not at all change the actual difficulty of inherent mathematical reasoning. This means that the model should theoretically test just as well in GSM-Symbolic as in GSM8K.

Instead, the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic and found that average accuracy decreased across the board compared to GSM8K, with performance decreasing by 0.3 percent to 9.2 percent depending on the model. I understand that. The results also showed high variance even when running GSM-Symbolic 50 times with different names and values. Within a single model, it is common to have accuracy gaps of up to 15% between the best and worst runs, and for some reason changing the numbers is less accurate than changing the names. There was a tendency to

As the researchers point out, “the overall inference steps required to solve the problem remain the same,” so this kind of difference is important when comparing results within different GSM symbolic runs and with GSM8K. Both cases are more than a little surprising. The fact that such small changes lead to such variable results suggests to the researchers that these models are not making “formal” inferences, but rather “attempts.” I’m doing it.[ing] It performs a type of within-distribution pattern matching, matching the given question and solution steps to similar ones found in the training data. ”

don’t get distracted

Still, the overall variance exhibited by GSM symbolic tests was often relatively small overall. For example, OpenAI’s ChatGPT-4o dropped from 95.2 percent accuracy on GSM8K to a still impressive 94.9 percent on GSM-Symbolic. This is a fairly high success rate using either benchmark, regardless of whether or not the model itself uses “formal” reasoning behind the scenes (although it is important to note that if the researcher takes one logical step to the problem) After adding just one or two, the total accuracy of many models dropped sharply).

However, because Apple researchers modified the GSM-Symbolic benchmark by adding “a seemingly related but ultimately unimportant statement” to the question, the LLM results tested were even more It got worse. In this “GSM-NoOp” benchmark set (short for “No Operations”), the question about how many kiwis you picked over multiple days was changed to include the additional detail “5 of them.” There is a possibility that [the kiwis] It was a little smaller than average. ”

Adding these dangerous issues results in what the researchers called a “catastrophic performance drop” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent depending on the model tested. did. Such a significant drop in accuracy highlights the inherent limitations of using simple “pattern matching” to “translate sentences into operations without truly understanding their meaning,” the study says. they wrote.

You Might Also Like

Small playful dot matrix screens of the company’s most expensive phone

These orcas are trying to nourish people, a new research show

I found the best Samsung Galaxy unboxed rumours for fold 7, Flip 7, Watch8

Breast cancer survivors show a lower risk of Alzheimer’s disease

How to watch Wimbledon 2025 Live outside the UK

TAGGED:AIsAppleengineersfragilereasoningshow
Share This Article
Facebook Twitter Email Print
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

Subscribe my Newsletter for new posts, tips & new Articles. Let's stay updated!

Popular News
Bold and beautiful next 2 weeks June 16-27: Hope & Taylor’s hope explodes
Entertainment

Bold and beautiful next 2 weeks June 16-27: Hope & Taylor’s hope explodes

Vantage Feed Vantage Feed June 16, 2025
SEBI suggests that investment advisors, analysts allow liquid MFS to be used for deposit requirements
The mysterious and highly active undersea volcano off the coast of Oregon could erupt later this year. What scientists expect
Dundee 3-4 Ranger
Life expectancy in the U.S. is increasing and approaching pre-pandemic levels
- Advertisement -
Ad imageAd image
Global Coronavirus Cases

Confirmed

0

Death

0

More Information:Covid-19 Statistics

Importent Links

  • About Us
  • Privacy Policy
  • Terms of Use
  • Contact
  • Disclaimer

About US

We are a dedicated team of journalists, writers, and editors who are passionate about delivering high-quality content that informs, educates, and inspires our readers.

Quick Links

  • Home
  • My Bookmarks
  • About Us
  • Contact

Categories & Tags

  • Business
  • Science
  • Politics
  • Technology
  • Entertainment
  • Sports
  • Environment
  • Culture
  • Caribbean News
  • Health

Subscribe US

Subscribe my Newsletter for new posts, tips & new Articles. Let's stay updated!

© 2024 Vantage Feed. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?