Yesterday, OpenAI CEO Altman issued an internal memo announcing that the company had entered a “Code Red” emergency state.
On the surface, this was OpenAI’s emergency response to its two formidable competitors, Google and Anthropic.
However, the deeper issue is that OpenAI is facing a technical dilemma that the entire industry cannot avoid. That is, training costs are soaring, model sizes are continuously expanding, but performance improvements are becoming increasingly limited.
According to Stanford University’s “2025 AI Index Report,” between 2019 and 2022, for every 10-fold increase in training costs, model performance on mainstream benchmark tests improved by an average of 25% to 35%. But after 2023, with the same 10-fold cost increase, performance improvements dropped to only 10% to 15%.
What’s worse is that since 2024, even when training costs have doubled again, performance improvements have often been less than 5%. The return on investment is plummeting.
The performances of leading models from various companies are starting to converge, as if they have collectively hit an invisible ceiling.
This has sparked a heated debate in the AI academic and industrial circles: Have large language models hit a dead end?
01
Firstly, from the perspective of user data, OpenAI’s leading position has begun to waver.
Google’s Gemini 3 model has surpassed OpenAI in benchmark tests, causing Gemini’s monthly active users to soar. Google’s third-quarter financial report revealed that Gemini’s monthly active users have increased from 450 million in July to 650 million in October.
At the same time, Anthropic’s Claude is also gaining popularity among enterprise customers. According to OpenRouter’s data, as of the end of November 2025, Claude’s weekly visits reached 41 million, a 17.1% increase from six weeks ago.
But the more alarming news is yet to come.
According to a report by semiconductor industry analysis company SemiAnalysis, since the release of GPT-4o in May 2024, OpenAI’s top researchers have not successfully completed a large-scale comprehensive pre-training.
This means that between GPT-5 and GPT-4o, there has been no real generational upgrade. It is more like fine-tuning and optimization based on GPT-4o rather than a newly trained model.
SemiAnalysis further criticized OpenAI in its analysis: “Pre-training a cutting-edge model is the most difficult and resource-intensive challenge in the entire AI R&D process. Google’s TPU platform has decisively passed this test, but OpenAI has not.”
Pre-training is the first and most crucial step in training large language models. During this stage, the model learns the basic rules of language, such as grammar, semantics, and factual knowledge, from massive amounts of text data.
The inability to complete large-scale pre-training means that it is impossible to upgrade to the next generation of models, which is fatal for a company like OpenAI that must maintain technological leadership.
The MMLU (Massive Multitask Language Understanding) scores further support SemiAnalysis’s viewpoint. MMLU is a core authoritative benchmark test for measuring the comprehensive knowledge and reasoning abilities of large models.
From the results, GPT-5’s MMLU score is only 10% to 20% higher than that of GPT-4.
It should be noted that Anthropic CEO Dario Amodei has publicly stated that the training cost of large models in 2024-2025 will be between 1 billion and 2 billion US dollars, 10 times that of large models a year ago. And the cost of GPT-5 is about 20 to 30 times higher than that of GPT-4 (approximately 60 million to 100 million US dollars).
Faced with such dual dilemmas, Altman had to adjust strategies and shift the focus to optimizing existing products.
In the memo, Altman stated that the company needs to improve the personalization features of ChatGPT, increase its speed and reliability, and expand the range of questions it can answer.
To this end, OpenAI decided to postpone the development of other projects such as advertising, health and shopping AI agents, and a personal assistant named Pulse. It encouraged employees to temporarily transfer positions and hold daily special meetings to discuss improvements to ChatGPT.
Prior to this, OpenAI had sounded a “Code Orange” alarm in October 2025.
OpenAI’s internal alarms are divided into three levels: yellow, orange, and red. The redder the color, the more serious the situation. The criteria for sounding internal alarms are based on OpenAI’s current market competition pressure and product crises.
Orange alarms correspond to clear competitive threats or product crises, where the core business has entered a “passive situation,” such as market share erosion and user loss. It requires OpenAI to “tilt resources locally” to respond.
At that time, OpenAI’s approach was to establish an “emergency optimization team,” led by core leaders in product, technology, and algorithms, and allocate more than 50% of R&D resources to focus on core products.
02
However, OpenAI is not the only company stuck in a bottleneck. The entire industry is facing the same dilemma.
From the end of 2024 to the beginning of 2025, the performance improvement curve of top large models has shown a significant flattening. According to blind test data from LMSYS Chatbot Arena, in June 2024, the Elo score gap between the top-ranked and the tenth-ranked models exceeded 150 points.
But by November 2025, this gap had narrowed to less than 50 points. More notably, the scores of almost all mainstream models on key benchmark tests have started to concentrate within a narrow range. This trend means that even though companies invest vastly different amounts of resources (ranging from tens of millions to billions of US dollars), the performance of the resulting models is becoming increasingly similar.
In March 2023, when OpenAI just released GPT-4, its score in the MMLU test was indeed 86.4%. At that time, the scores of mainstream competitors mostly ranged from 60% to 75%. For example, Claude v1 scored only 75.6% in the same test, and LLaMA-65 scored only 63.4%.
However, in the MMLU-Pro (an advanced version of MMLU with stricter scoring standards) in September 2025, all leading models scored between 85% and 90%, with almost no difference.
In terms of update frequency, the interval between Meta’s Llama model from the second generation to the third generation was about 9 months, while the interval from Llama 3 to the planned Llama 4 has exceeded 15 months. The interval from Anthropic’s Claude 3 to Claude 4 was also as long as 11 months.
All signs indicate that the Scaling Law, which was once regarded as the golden rule for large language models, is losing its effectiveness.
The reason for this result actually lies within the large models themselves.
The core task of large model training is to “predict the next word.”
By repeatedly training on massive amounts of text, the model gradually learns grammar, common sense, reasoning abilities, etc. When the model is already strong enough to understand grammar and common sense, the uncertainty inherent in language becomes a variable affecting the model’s output.
For example: “He put the apple on the table, and then it disappeared.” Here, “it” could refer to the apple or the table. Grammatically, both interpretations are valid. To figure out what “it” refers to, what is needed is not better grammar knowledge but common sense judgment about the real world.
But if we change the sentence: “He put the phone on the table, and then it fell over.” Here, “it” could be the phone or the table. If it’s a cheap folding table, it might indeed fall over because of the phone. If the phone’s case is open, the phone itself might also fall. Without sufficient context, even humans would find it difficult to make an accurate judgment.
This kind of error caused by the ambiguity and uncertainty of language itself is called “irreducible error” (or “Bayes error rate”) in statistics.
Even if you have a perfect algorithm, infinite data, and computing power, this error cannot be eliminated. It is an inherent characteristic of the problem itself.
Human language is full of such uncertainty. When we speak, a lot of information is conveyed through context, body language, tone, and shared background knowledge. When all these are removed and only pure text remains, there is a huge loss of information.
Large language models are trained on this pure text, so they inherently face the limitation of irreducible error.
When the model is still weak, it makes many low-level mistakes, such as grammatical errors, factual errors, and logical errors. These can be solved by increasing data, enlarging the model, and improving algorithms. But when the model is already strong enough and no longer makes these low-level mistakes, the remaining errors are mainly these irreducible errors caused by the inherent characteristics of language.
At this stage, no matter how much money and resources are thrown at it, the improvement will be limited.
The second problem is data depletion. By the time of GPT-4, OpenAI had almost learned all the high-quality text on the entire Internet. Various encyclopedias, digital libraries, GitHub code, Reddit discussions, and all kinds of professional papers and documents.
Almost all high-quality data has been used up. What remains is a large amount of low-quality content, such as advertising soft articles, spam posts, repetitive content, and machine-generated garbage information.
To solve the data shortage problem, some manufacturers have started using AI-generated data to train AI. But this leads to a serious problem called “model collapse.” Simply put, if a model only consumes data produced by itself or other models, its diversity will decrease, and it may even amplify its own errors and biases, eventually causing the model to become increasingly stupid and its output increasingly monotonous.
This process is similar to inbreeding in biology. In the biological world, if a population engages in inbreeding for a long time, its genetic diversity will gradually decrease, and genetic defects will be amplified, eventually leading to population degradation. Model collapse follows the same principle.
A paper published in Nature in 2024 titled “When AI models are trained on recursively generated data, performance collapses” systematically studied this issue. Researchers found that in the early stages of model collapse, the model first loses information from the tails of the data distribution. In the later stages, the entire data distribution converges to a very narrow range, with almost no similarity to the original data.
Researchers conducted experiments: They used a pre-trained language model to generate a batch of text, then used this batch of text to train a new model, then used the new model to generate text, and then trained an even newer model… After several generations of this process, the model’s output became increasingly monotonous and repetitive, and the important information that originally appeared with low frequency but was crucial (such as professional knowledge and niche but correct viewpoints) gradually disappeared.
Every time a model generates data, it tends to generate content that is most common and “safe” in the training data. Information that appears with low frequency and is on the periphery has an even lower probability of appearing in the generated data. After several generations of iteration, this information is completely lost.
What’s more troublesome is that the Internet is now flooded with a large amount of AI-generated content. After the release of ChatGPT, articles, social media posts, and even academic papers on the Internet have increasingly shown signs of being generated by AI.
If future models obtain training data by crawling the Internet, they will inevitably include this AI-generated content. This means that model collapse is no longer just a theoretical problem in the laboratory but an actual threat that the entire AI industry will face.
03
The question of whether large language models have hit a dead end has always been controversial.
The reformists, represented by AI pioneer Fei-Fei Li, believe that large language models are not omnipotent; they are just one component of AI systems. To achieve true artificial intelligence, different types of tasks need to be assigned to different types of models.
Fei-Fei Li has bluntly stated that AGI (Artificial General Intelligence) is a marketing term, not a scientific term. What is truly lacking nowadays is not “general intelligence” but “spatial intelligence,” that is, the ability to understand and manipulate the three-dimensional physical world.
She believes that future AI systems may be “world models.” Their core ability is to understand three-dimensional space, physical laws, and causal relationships. They do not understand the world by learning text but by observing videos, images, and sensor data to build a cognition of the physical world.
World models use strict logical rules and mathematical proof techniques instead of relying on statistical patterns like current large language models.
Google DeepMind’s AlphaGeometry is an example in this direction. It can solve Olympic-level geometry problems, not relying on language models but on a combination of symbolic reasoning systems and neural networks.
Turing Award winner and former Meta Chief AI Scientist Yann LeCun is even more direct in his criticism of the language model path. He describes it as “feeding larger chips to a parrot.”
In his view, language models are only learning statistical patterns and doing pattern matching; they do not truly understand the world. To achieve true intelligence, AI must build a model of the physical world and understand basic concepts such as objects, space, time, and causality.
In this scenario, large language models will serve as “translators.” When a user makes a request in natural language, the large language model is responsible for understanding the request, translating it into instructions that machines can process, and assigning them to appropriate subsystems like world models for execution.
When the task is completed, the large language model then translates the results into natural and fluent human language and outputs them to the user.
OpenAI and Anthropic, on the other hand, are the conservatives.
Altman believes that as long as language models are further enlarged and more data and computing power are invested, intelligence will “emerge automatically.”
He believes that when the model size reaches a certain critical point, there will be a sudden qualitative leap, and the model will acquire true understanding and reasoning abilities. This viewpoint is known as the “scaling hypothesis” in the industry.
OpenAI co-founder and Chief Scientist Ilya Sutskever believes that compression is understanding.
He admits, “If you can compress all the data in the world without loss into the neural network of a large language model, then this model has built a true model of the world internally.”
Anthropic co-founder Jared Kaplan believes that language models themselves may not be intelligence, but they can serve as the foundation for intelligence. He thinks that by improving training methods, enhancing safety alignment, and combining with other technologies, the language model path still has the potential to achieve AGI.
MIT cognitive scientist Evelina Fedorenko and several scholars from MIT and Berkeley published an article in the journal Nature. They pointed out that language is not thought; human thought is independent of language. Babies have an understanding of the physical world and cognition of causality before they learn to speak. Blind and deaf people’s thinking abilities are not affected despite the lack of certain sensory channels.
Since language is mainly a communication tool rather than a thinking tool, language models cannot be true artificial intelligence.