The more scaled up LLMs get, the more likely they are to fudge an answer rather than admit their ignorance.
Copyright Canva
By Anna Desmarais
Published on 01/10/2024 – 7:30 GMT+2•Updated 9:23
Share this article
Comments
According to a new study, the more advanced an AI large language model (LLM) becomes, the less likely it is to admit it can’t answer a query.

Newer large language models (LLMs) are less likely to admit they don’t know an answer to a user’s question making them less reliable, according to a new study.

Artificial intelligence (AI) researchers from the Universitat Politècnica de València in Spain tested the latest versions of BigScience’s BLOOM, Meta’s Llama, and OpenAI’s GPT for accuracy by asking each model thousands of questions on maths, science, and geography.

qatar airways

Researchers compared the quality of the answers of each model and classified them into correct, incorrect, or avoidant answers.

Related
Microsoft claims its new AI correction feature can fix hallucinations. Does it work?
The study, which was published in the journal Nature, found that accuracy on more challenging problems improved with each new model. Still, they tended to be less transparent about whether they could answer a question correctly.

The earlier LLM models would say they could not find the answers or needed more information to come to an answer, but new models were more likely to guess and produce incorrect responses even to easy questions.

‘No apparent improvement’ in solving basic problems
LLMs are deep learning algorithms that use AI to understand, predict, and generate new content based on data sets.

While the new models could solve more complex problems with more accuracy, the LLMs in the study still made some mistakes when answering basic questions.

“Full reliability is not even achieved at very low difficulty levels,” according to the research paper.

“Although the models can solve highly challenging instances, they also still fail at very simple ones”.

Related
Prisoners in Finland are being employed as data labellers to improve accuracy of AI models
This is the case with OpenAI’s GPT-4, where the number of “avoidant” answers significantly dropped off from its previous model, GPT-3.5.

“This does not match the expectation that more recent LLMs would more successfully avoid answering outside their operating range,” the study authors said.

Researchers concluded then that there’s “no apparent improvement” for the models even though the technology has been scaled up.

LEAVE A REPLY