A classic brain test exposed AI's biggest weakness

A classic brain test exposed AI's biggest weakness - Date: - June 10, 2026 - Source: - PNAS Nexus - Summary: - Researchers gave top AI models a classic attention test used in psychology and found a major flaw. While the models could correctly name colors in short lists, their performance deteriorated sharply as the task became longer and more complex. Some leading systems fell from over 90% accuracy to nearly complete failure. - Share: Artificial intelligence systems can write essays, answer questions, and solve complex problems. But new research suggests they may struggle with something humans do every day: staying focused on the task at hand when distractions get in the way. Researchers led by Suketu Patel put several leading AI models through a well-known psychology experiment called the Stroop task. The results revealed a significant difference between how AI systems process information and how the human brain manages attention. What Is the Stroop Task? The Stroop task is a classic psychological test that has been used for decades to study attention, concentration, and self-control. In the test, color words such as "red," "blue," or "green" are displayed in colored ink. Sometimes the word and the ink color match. For example, the word "red" might appear in red ink. Other times they conflict, such as the word "red" printed in blue ink. Participants are asked to name the color of the ink rather than read the word itself. That sounds simple, but it creates a challenge because reading words is an automatic habit for most people. The brain must suppress the urge to read the word and instead focus on identifying the ink color. Psychologists often use the task to measure what is known as executive control, a set of mental processes that helps people regulate attention, resist distractions, and stay focused on goals. Testing AI Attention The researchers wanted to see whether modern large language models (LLMs) handle this challenge in the same way humans do. LLMs are the AI systems behind tools such as ChatGPT, Claude, and Gemini. They are trained on enormous amounts of text and learn patterns in language, allowing them to generate responses that often appear remarkably human. When given short lists containing five color words, the AI systems generally performed well, even when the words and colors did not match. However, the picture changed dramatically as the lists became longer. GPT-4o achieved 91% accuracy when working with five words. At ten words, its accuracy fell to 57%. When the list expanded to forty words, accuracy dropped to just 15%. Claude 3.5 Sonnet maintained stable performance through lists of twenty words but then experienced a sharp decline, falling to 24% accuracy with forty-word lists. The researchers observed similar patterns in GPT-5, Claude Opus 4.1, and Gemini 2.5. When AI Loses Focus The challenge became even more difficult when matching and mismatched color words appeared together in the same list. Under those conditions, performance deteriorated further. Accuracy for the mismatched items dropped to nearly zero in some cases. According to the researchers, the AI models had trouble maintaining the instruction to identify ink colors. Instead, they increasingly defaulted to reading the words themselves. In other words, the systems appeared unable to consistently suppress the response they had been most heavily trained to produce. This finding is particularly interesting because humans face a similar conflict. People are generally much better at reading words than naming ink colors. Yet despite this bias, most individuals can maintain high accuracy and stable performance even when confronted with long lists of conflicting words and colors. Human Attention vs. Machine Attention The study highlights an important distinction between human and artificial intelligence. Although modern AI systems can produce impressive language and reasoning capabilities, their underlying mechanisms differ from the attention processes found in biological brains. Humans can often sustain focus on a specific goal while filtering out competing information. The results suggest that current AI models may struggle with this type of cognitive control when tasks become increasingly demanding. The researchers argue that the performance collapse seen in these experiments points to fundamental limitations in today's large language models. While AI can sometimes mimic human behavior, its ability to maintain attention appears to operate very differently from the way people do. The findings offer a reminder that even the most advanced AI systems still have weaknesses, particularly when tasks require them to resist distractions and stay focused over extended sequences of information. Story Source: Materials provided by PNAS Nexus. Note: Content may be edited for style and length. Journal Reference: - Suketu Chandrakant Patel, Hongbin Wang, Jin Fan. Deficient executive control in transformer attention. PNAS Nexus, 2026; 5 (6) DOI: 10.1093/pnasnexus/pgag149 Cite This Page:

A classic brain test exposed AI's biggest weakness

Related Stories

Link between poverty and access to nature | Letter

The Last Evolution, by John W Campbell Jr. (1932)

Genetically modified worms can now produce and deliver drugs inside a living body, scientists say

Indonesia Landslides Devastated Endangered Orangutans, Study Finds