SPOTLIGHT

Google Is Building a Chip That Puts the Processor Inside the Memory — And It Could Make NVIDIA’s Architecture Obsolete

E Elena Volkov Apr 21, 2026 8 min read
Engine Score 9/10 — Critical

This story details a potentially revolutionary chip architecture from Google that could significantly disrupt the AI hardware market, including NVIDIA's dominance. Its high novelty and immense industry impact make it a critical development for the future of AI computing.

Editorial illustration for: Google Is Building a Chip That Puts the Processor Inside the Memory — And It Could Make NVIDIA's

Google (Alphabet Inc.) is in active talks with Marvell Technology Group to co-develop a memory processing unit (MPU) — a chip architecture that performs computation directly inside memory, rather than moving data to a separate processor — to pair with its Tensor Processing Units, The Information reported on April 21, 2026. A parallel effort involves a new TPU variant designed specifically for AI model inference. Together, these two initiatives represent the most architecturally ambitious challenge yet to the hardware paradigm that NVIDIA has dominated for a decade — and they target the single bottleneck that scaling a bigger GPU cannot solve.

The Memory Wall That Has Constrained Every AI Chip

The memory wall is not a new concept. Computer architects named it in 1995, when processor clock speeds began outpacing memory bandwidth by a widening margin. What has changed is the scale of the penalty: today’s large language models require moving hundreds of gigabytes of parameters between memory and compute on every forward pass, a data-shuffling tax that consumes energy, adds latency, and caps throughput regardless of how fast the compute die itself runs.

NVIDIA’s H100 GPU delivers 3.35 terabytes per second of HBM3 memory bandwidth — roughly three times faster than the A100 it replaced. That sounds like progress. The problem is that a 70-billion-parameter model like Meta’s Llama 3 requires approximately 140GB of memory at FP16 precision, meaning even a fully saturated H100 must cycle through its entire memory pool multiple times per second to sustain inference throughput. At this scale, memory bandwidth — not raw compute — is the binding constraint. Adding more FLOPS does not fix a bandwidth problem.

This is the regime Google’s MPU targets: execute mathematical operations where data already lives, rather than shuttling tensors across a memory bus to a separate processing die.

What a Memory Processing Unit Actually Does

Processing-in-memory (PIM) has existed as a research concept since the 1990s. The engineering obstacle has always been consistent: memory cells are optimized for density and read/write speed, not arithmetic. Adding compute logic to a DRAM die either reduces memory density, increases heat, or both — tradeoffs that made PIM impractical at scale for conventional computing workloads.

What makes this moment different is the maturation of High Bandwidth Memory stacking technology. HBM already places memory dies in a vertical stack connected by thousands of through-silicon vias. Adding a logic layer at the base of that stack — where an MPU would live — is architecturally feasible in a way it was not for standard DRAM. Samsung demonstrated this with its HBM-PIM architecture, disclosed in 2021, which showed a 2.67x throughput improvement and 70% reduction in energy consumption for specific AI inference operations compared to standard HBM.

Google’s proposed MPU, as described by The Information, goes further than a simple arithmetic overlay. The architecture is designed to understand tensor operations natively — making the memory itself a programmable compute substrate tightly integrated with the TPU’s dataflow model, rather than a dumb storage layer with arithmetic bolted on top. The difference between those two approaches is the difference between a parlor trick and an architectural shift.

Why Marvell, and What the Partnership Structure Reveals

Marvell Technology Group Ltd. (NASDAQ: MRVL) is not a household name outside semiconductor circles, but it has built a substantial position in custom silicon for hyperscalers. The company manufactures custom networking ASICs for Amazon, Google, and Microsoft, and its electro-optics chiplets are embedded in major AI switching infrastructure. Marvell’s stock has surged consistently on AI infrastructure demand through 2025 and into 2026, a signal of how central custom silicon has become to hyperscaler competitive strategy.

The partnership structure reported by The Information positions Marvell as the manufacturing and advanced packaging partner while Google contributes the microarchitecture and TPU integration design. This mirrors Google’s existing TPU program structure, where Google handles compute logic and partners externally for fabrication — a division of labor that has produced six TPU generations since 2016.

Marvell’s specific expertise is in chiplet integration and HBM interface design. Building a functional MPU requires not just a compute layer embedded in memory, but a high-bandwidth, low-latency interconnect between that compute layer and the host TPU — precisely the class of problem Marvell’s engineering organization exists to solve. Google is not trying to build this alone, and the choice of partner is not arbitrary.

The Inference TPU: Google’s Second Prong

The MPU initiative runs in parallel with a separate effort: a new TPU variant designed specifically for inference rather than training. The architectural distinction matters. Training requires large amounts of high-precision matrix multiply-accumulate operations and can tolerate latency. Inference is latency-sensitive, runs increasingly at lower numerical precision (INT8 or FP8 rather than BF16), and is dominated by memory-access patterns rather than raw compute throughput.

NVIDIA addressed this split by releasing the H100 for training while repositioning the L40S and L4 for inference at scale. Google’s custom approach allows something NVIDIA cannot easily replicate: co-designing the inference TPU and the MPU as a single architectural system, optimizing both chips for the specific data-flow patterns of transformer inference rather than adapting a general-purpose GPU to a task it was not built for.

The infrastructure investment context matters. Nebius is committing $10 billion to AI data center construction in Finland, illustrating how compute infrastructure spending is accelerating globally even as efficiency architectures like Google’s MPU threaten to restructure the underlying hardware economics before those data centers are finished.

Why This Is Architecturally More Important Than Bigger GPUs

NVIDIA’s product cadence follows a consistent pattern: each generation delivers more HBM capacity, more HBM bandwidth, and more FP8 FLOPS than the last. Blackwell increased per-GPU HBM capacity to 192GB and introduced NVLink 5.0 interconnects capable of 1.8 terabytes per second between chips. These are genuine advances — and they are advances within the same architectural paradigm, pushing data faster across the same fundamental memory-processor gap.

Google’s MPU concept is a paradigm break rather than a refinement. When computation moves to where data already lives, you eliminate the data-movement energy budget. In modern AI inference, that budget accounts for an estimated 40 to 60% of total system power consumption, according to research from MIT’s Computer Science and Artificial Intelligence Laboratory. At datacenter scale, a 40% reduction in per-token energy cost translates directly to lower inference pricing — and potentially undermines the economics of every NVIDIA-based inference cluster simultaneously.

The secondary effect is equally important. Lower effective memory-bandwidth requirements mean smaller, cheaper systems can serve larger models. An MPU-equipped inference cluster could plausibly run a 70-billion-parameter model on hardware configurations that today require significantly larger GPU installations — a shift that changes the capital expenditure calculus for every enterprise deploying AI at scale.

The Competitive Moat — and What It Means for Inference Pricing

Google’s TPU program has already demonstrated that custom silicon builds defensible competitive position. TPU v4 pods delivered approximately 1.1 exaflops of compute, enabling Google to run Gemini training at a cost structure external cloud customers cannot replicate with rented GPU capacity. The MPU extends that logic from training into inference — the workload that increasingly dominates total AI compute spending.

The commercial implication is direct. Deals like OpenAI’s content partnerships with Disney illustrate how inference delivery — not model training — has become the commercial chokepoint in AI. Access to cheap, low-latency inference is the product now being sold. Every percentage point reduction in inference cost improves Google’s margin on Cloud TPU access and Gemini API delivery while putting structural pressure on competitors whose infrastructure costs are higher by design.

MegaOne AI tracks 139+ AI tools across 17 categories. The clearest pattern emerging across both foundation model providers and infrastructure vendors is a shift from training-centric competition to inference-efficiency competition. Google’s MPU bet is a direct play for that terrain — and if it ships, it arrives precisely when inference economics have become the primary battleground.

The NVIDIA Question — and the Realistic Timeline

NVIDIA is not standing still. The company acquired Mellanox for networking expertise, has deep partnerships with SK Hynix and Micron on HBM development, and its Blackwell Ultra roadmap includes continued HBM capacity scaling. More importantly, NVIDIA’s CUDA ecosystem represents roughly 15 years of accumulated developer tooling, library optimization, and institutional knowledge — a moat that a new chip architecture cannot dissolve immediately regardless of the hardware specifications.

NVIDIA’s data center revenue reached $47.5 billion in fiscal year 2025. That revenue depends on a hardware paradigm where compute and memory are separate, connected by a bandwidth-limited bus, and where moving data between them is unavoidable. Google’s MPU challenges that assumption at the architectural level — not just the performance level.

The timeline question is real. Custom silicon programs routinely require three to five years from initial design to volume deployment. Google’s original TPU v1 was deployed internally in 2015 and only became publicly available on Google Cloud in 2018. An MPU entering serious design discussions in 2026 is unlikely to reach production scale before 2029. The AI application layer is already evolving faster than infrastructure planning cycles — which is precisely why architectural bets made now determine competitive position years ahead.

What has already changed is the signal itself. Google is not trying to win the GPU race by building a better GPU. It is attempting to make the GPU paradigm less relevant by attacking the architectural assumption — memory and compute as separate, bandwidth-connected entities — that every GPU on the market is built around. That ambition, backed by Google’s silicon track record and Marvell’s packaging expertise, is the most structurally important hardware development in AI today. Not because the chip exists yet. Because the direction is now clear.

For enterprises planning AI infrastructure: treat GPU commitments as term-limited rather than indefinite. The hardware assumptions underlying current AI economics are less stable than the prevailing GPU supercycle suggests. The firms that lock in flexible, inference-optimized infrastructure postures now will have the most room to maneuver when Google’s architectural bet starts landing.

Related Reading

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime