How AI Inference Is Creating New Memory Demand
KV cache offloading and agentic AI as key drivers
“The memory system of AIs is going to cause the storage system to be completely revolutionized.” At GTC Taipei in June 2026, Nvidia founder and CEO Jensen Huang pointed to the memory system as one of the hardest parts in AI infrastructure. This challenge encompasses managing KV caching for the agent’s working memory, as well as retrieving structured and unstructured data and establishing data ontology.
To address the surging KV cache storage demands of the AI inference era, Nvidia introduced the CMX Context Memory Storage Platform in January 2026, managed by the BlueField-4 DPU, which adds a pod‑level context tier between local SSD and shared storage.
Meanwhile, the rise of Agentic AI is reshaping CPU architecture. Jensen noted that agents live in a world of nanoseconds, where every moment of waiting prevents them from advancing to the next step, making ultra-low latency the primary requirement. With Nvidia and Arm both launching CPU rack solutions purpose-built for agents, the industry is shifting from throughput-oriented to latency-oriented architectures, opening up an incremental market for CPU RAM.
Related report: Server DRAM Industry Analysis-2Q26
Test-time Scaling “Thinking”: >5X Tokens Per Year
According to Nvidia’s public data, the average output token count per question has surged at a rate exceeding 5x per year since the second half of 2024, reaching approximately 30,000 to 40,000 tokens, indicating that the industry has entered the Test-time Scaling “Thinking” stage of Nvidia’s Three Scaling Laws. This explosion in per-question token output translates directly into greater demands on memory and compute resources.
In the AI inference era, hardware requirements for AI chips and overall systems differ fundamentally from those of AI training. Inference places three key demands on hardware: (1) higher queries per second (QPS), (2) longer context windows, and (3) more inference steps and agentic AI loops. Each of these drives structural changes in memory demand. We will examine this across three dimensions: model weights, KV cache, and agentic AI.
Related report: AI Servers Absorbing LPDRAM Capacity, Signaling Tight Supply as the New Norm
Model weights
Model weights are the numerical parameters stored within an AI model. When a model is loaded, these weights occupy a fixed portion of memory, which is a static allocation. The more parameters a model has, the more memory its weights consume. The formula for calculating model weight memory is:
Total Size of Model Weights = Parameters × bytes per parameter
KV Cache
KV cache stores the key-value vectors generated during the inference prefill stage, avoiding redundant computation during the decode stage, which is a dynamic allocation. As conversation length and batch size grow in inference workloads, KV cache memory consumption grows accordingly.
The formula for calculating the total size of KV cache is:
Total Size of KV cache (Bytes)
= 2 x number of layers x number of KV heads x head dimension
x sequence length x batch x precision (bytes)
SSD POD Demand Driven by KV Cache Offloading
As KV cache memory footprint expands dramatically with conversation length and batch size, its management and placement have become critical in AI inference applications. In long-context, high-batch workloads, when the GPU’s HBM capacity is insufficient, the system must discard KV cache and rerun prefill computation, increasing latency and raising total cost of ownership (TCO).
To address this, Nvidia released Dynamo, a KV cache offloading software, in March 2025. Dynamo offloads infrequently accessed KV cache to lower-bandwidth, higher-capacity, and more cost-effective tiers such as CPU RAM and SSD, ensuring this data remains reusable during the inference decode stage without requiring prefill recomputation.
Dynamo can be paired with the CMX Context Memory Storage Platform, which Nvidia introduced in January 2026, designed to store and manage the massive KV cache generated by long-context workloads. Built around the BlueField-4 STX Rack, CMX uses 64 BlueField-4 DPUs (4 BlueField-4 DPUs x 16 compute trays) to manage approximately 9,600 TB of capacity per rack, adding a pod‑level context tier between local SSD and shared storage.

Note: An SSD POD is a standalone storage unit comprising multiple SSD racks, dedicated to storing offloaded KV cache. Positioned as the G3.5 tier between local SSD (G3) and shared storage (G4), it provides greater capacity than local SSD (G3) for storing KV cache, while offering faster access speeds than shared storage (G4).
Agentic AI: More CPUs, More RAM
The proliferation of AI inference is also accelerating the deployment of agentic AI applications. In AI agent workflows, models must actively plan, call tools, make decisions, and execute actions on behalf of users, all of which require CPUs to handle orchestration, tool calls, data routing, and sub-agent evaluation. As a result, the CPU-to-GPU workload ratio in agentic AI deployments is expected to shift from a traditional ratio of 1:4 or 1:8 to approximately 1:1, creating significant growth potential for the CPU market and driving new demand for CPU RAM.
Related report: 2026 Agentic AI Wave: CPU Shortage and GPU Ratio Structural Changes
In 2026, Nvidia introduced the Vera CPU, purpose-built for agentic AI workloads. Based on the original specification, Vera supports up to 1.5 TB of LPDDR5X memory capacity, three times the capacity of the previous-generation Grace CPU.
Note: NVIDIA has decided to halve the SOCAMM memory capacity of its next-generation Vera Rubin Superchip modules, according to TrendForce’s latest findings. This adjustment does not reflect a reduction in NVIDIA’s overall memory demand. Rather, it is a response to insufficient LPDRAM capacity allocated to NVIDIA under its suppliers’ preliminary 2027 production plans.

Beyond Nvidia, the broader CPU landscape is also accelerating. Traditional x86 vendors Intel and AMD launched new products in 2026, with Intel’s Xeon 6+ (Clearwater Forest) and AMD’s EPYC Venice. On the Arm side, Arm launched the Arm AGI CPU, while Ampere’s AmpereOne MX is expected to enter mass production later this year. 2026 is shaping up to be the year of a full-spectrum CPU refresh for agentic AI.

In sum, we are seeing two new drivers of memory demand generated by AI inference. On one front, inference workloads are driving KV cache consumption to expand rapidly, and KV cache offloading technology enables large volumes of KV cache to be offloaded to CPU RAM and SSD POD. As Nvidia, Google, and others introduce new SSD POD platforms, demand in this segment is expected to continue rising.
On the other front, agentic AI is shifting the CPU-to-GPU workload ratio toward 1:1, creating significant growth potential for CPU demand and driving a corresponding increase in CPU RAM demand.








