For many years, when discussing the biggest hurdles to AI, most discussions revolved around two factors: computing power and the power required to operate those systems. However, at CES 2026, NVIDIA CEO Jensen Huang offered a different perspective, calling "context" the new bottleneck in the AI industry. According to him, AI research labs and cloud service providers are increasingly struggling to handle the amount of memory needed to run modern AI models.
At first glance, this might seem counterintuitive. Today's AI models are capable of processing millions of tokens, performing complex tasks, and even operating as automated AI agents. So why is memory such a major issue?
The answer lies in a change in how AI is used.
From AI training problems to AI operational problems.
In the early stages of the AI race, much of the attention was focused on training large language models. This process required a massive amount of GPUs and very large initial investment costs. However, training is just one of the costs incurred with each model iteration.
Today, as businesses begin to integrate AI into their operations, the more important challenge is inference—the process by which models generate feedback for users.
Unlike training, the need for inference increases proportionally with the number of users and the number of tasks handled by AI. The more people use AI, the more resources are required.
This is particularly evident as AI is transitioning from simple chatbots to agent systems capable of working continuously in the background. These agents need to remember context, save work state, track interaction history, and maintain context over extended periods.
In other words, today's AI needs not only to "think," but also to "remember."
Why is context so important?
An AI model cannot generate accurate responses without context. Each time a user asks a question, the entire conversation history, system instructions, and related data must be retrieved and fed back to the GPU for processing. Without this context, the AI would be unable to sustain the conversation or complete multi-step tasks.
The problem is exacerbated by the fact that modern AI agents can run continuously for hours or even days. A market research agent might need to memorize dozens of different documents. A programming agent must maintain the status of an entire project. A business assistant might need to track hundreds of emails and documents related to the same task.
All that information needs to be stored somewhere and readily accessible. That's where memory becomes a critical factor in system performance.
KV Cache: The "working memory" of AI
The core of the problem lies in a component called the KV Cache (Key-Value Cache). This can be considered the working memory area that the model uses during prompt processing. The KV Cache is stored in the GPU's High Bandwidth Memory (HBM)—a type of ultra-high-speed memory that allows the model to access data almost instantaneously.
However, HBM has a major drawback: cost. According to AI infrastructure experts, the cost of HBM can currently reach around $10,000 per terabyte of capacity. This makes expanding memory by adding more GPUs an extremely expensive option.
This is why many companies are looking to build new memory architectures instead of continuing to rely entirely on HBM.
As the context window grows larger and larger.
While a few years ago the context window of AI models was typically only a few tens of thousands of tokens, many models have now surpassed one million tokens. This allows AI to process much larger amounts of information, but it also causes a surge in memory requirements.
A serious problem arises when the model exceeds the context region stored in working memory. In that case, the system may have to recalculate from scratch to restore the state.
According to industry studies, processing times can increase by 20 to 40 times in such situations. If this happens across millions of agent loops or thousands of programmers waiting for AI to generate source code, the resulting costs would become enormous.
Not only is GPU usage wasted, but employee waiting time also becomes a significant cost for businesses.
SSDs are becoming part of AI memory systems.
To address this problem, many companies are changing their perspective on storage.
Previously, SSDs were primarily seen as a place for long-term data storage. But in the new generation of AI infrastructure, SSDs are being used as an extended memory layer for GPUs.
Although SSDs are slower than HBM, the cost per terabyte is significantly lower. This allows businesses to store larger amounts of context without having to purchase a huge number of additional GPUs.
The idea is to keep only the contextual parts that need immediate access in the HBM, while the rest is stored on high-speed SSD systems and loaded when needed.
This is a trade-off between speed and cost, but it's becoming increasingly reasonable as the demand for AI operations surges.
One of the most notable announcements at CES 2026 was NVIDIA's CMX Context Memory Platform. Instead of treating memory as a resource tied to each individual GPU, NVIDIA is building a system where entire GPU clusters can access a single, massive context repository.
This platform combines high-density storage with BlueField-4 processors and allows a GPU cluster to access up to 18 petabytes of cached contextual data. This means any GPU in the cluster can retrieve a previous conversation or working state without recalculating from scratch.
Essentially, this represents a shift from the mindset of "each GPU managing its own memory" to "the entire system sharing a common memory layer."
The AI race is no longer just about GPUs.
For years, GPUs have been the focus of every discussion about AI. Companies that own more GPUs are often seen as having a greater advantage. But reality is showing that GPUs are only part of the equation.
As AI models grow larger and the number of agents increases exponentially, the efficiency of memory systems will become just as important as computing power.
A GPU costing tens of thousands of dollars won't be very valuable if it's constantly waiting for data to load. Similarly, an expensive team of engineers will struggle to be effective if the AI tools they use are constantly bottlenecked due to insufficient memory.
That's why leading companies today are not only focusing on developing more powerful chips but are also redesigning entire storage and memory architectures for AI.
As AI enters the large-scale, practical deployment phase, the most important challenge is no longer simply training models or purchasing more GPUs. The new challenge lies in how AI can effectively remember, retrieve, and maintain context.
The explosion of AI agents, the millions-token context window, and the ever-increasing demand for inference are transforming memory into one of the most strategic resources for the AI industry.
In the coming years, the AI race may no longer be decided solely by which model is smarter, but rather by which company builds a more efficient memory system to enable AI to operate at a real-world scale.