Cloud & Infra Article

Ditching HBM: Inside the Monolithic 3D AI ASIC

PhantaField’s Sophon PFG-1 puts 330 GB of DRAM on-die, bypassing the GPU memory wall for low-batch LLMs.

Ji-ho Choi

Security & Cloud Editor · Jun 29, 2026 · 5 min read

Ditching HBM: Inside the Monolithic 3D AI ASIC

The memory wall is the defining constraint of modern large language model inference and training. While high-bandwidth memory (HBM) roadmaps chase increasingly expensive stacking technologies, the physical separation between logic and memory remains a fundamental bottleneck. High-bandwidth memory subsystems like HBM4 offer massive parallel buses, but they still require signals to travel across silicon interposers, consuming significant energy and limiting low-batch latency.

PhantaField's PFG-1, codenamed "Sophon," represents a radical departure from this architecture. By abandoning off-die HBM entirely, the 750 mm² monolithic 3D (M3D) ASIC integrates 330 GB of on-die DRAM directly into the compute stack. Built on a 28 nm silicon Complementary Metal-Oxide-Semiconductor (CMOS) base tier with a 32-tier 2D Transition-Metal Dichalcogenide (TMD) CMOS MAC stack, Sophon bypasses the physical and economic constraints of traditional GPU memory architectures.

The Physics of 2T0C Gain-Cell DRAM

To understand how Sophon fits 330 GB of writable memory onto a single die without melting the silicon, we have to look at the memory cell itself. Traditional DRAM requires a capacitor to store charge, which is difficult to scale vertically and suffers from high leakage rates, demanding frequent refresh cycles.

Sophon uses a 2-transistor, 0-capacitor (2T0C) gain-cell DRAM configuration. This design exploits the exceptionally low off-current density of 2D-TMD transistors, which sits at approximately 1 fA/µm at 28 nm (roughly 0.5 fA per cell). Because the leakage is so low, the cell can retain its charge for several seconds without an explicit storage capacitor.

This capacitor-less design allows the memory module to be embedded directly at the Back-End-Of-Line (BEOL) Metal-3 layer of each memory tier. The results are highly consequential for both idle power consumption and write endurance:

Endurance: Unlike non-volatile alternatives like Resistive RAM (RRAM), which degrade after roughly 10⁶ write cycles, the 2T0C TMD cells have unlimited write endurance. This makes the memory fully read-write symmetric, capable of handling the 10¹⁰ write cycles per parameter required during deep learning training.
Power: The multi-second retention time reduces the refresh overhead to just 0.08 W. When the chip is idle, the entire 330 GB memory array can remain resident on a power budget of only 3 W.
Write Energy: In-place gradient writes consume only 20 fJ/bit, allowing the chip to execute forward and backward training passes within a manageable thermal envelope.

Pure Digital Compute-In-Memory

Sophon does not use traditional Von Neumann execution units. Instead, it relies on pure digital Compute-In-Memory (CIM). The die contains 131,072 individual CIM tiles. Each tile pairs a 256×256 DRAM subarray with a binary sense amplifier and an 8-level adder tree.

xychart-beta
  title "LLM Decode Efficiency (80B Model, Tokens/Sec per Watt)"
  x-axis ["NVIDIA Rubin / AMD MI455X", "Sophon PFG-1"]
  y-axis "Tokens/s per Watt" 0 --> 40
  bar [0.22, 38.7]

Rather than fetching weights to a central execution engine, activations are broadcast across the tiles via a 500 MHz bit-serial network. The binary sense amplifiers and adder trees execute the multiplication and accumulation directly inside the memory array.

This architecture yields 2,100 TFLOPS of BF16 compute and 4,200 TFLOPS of FP8 compute. While peak dense FLOPS favor next-generation GPUs, Sophon's real-world advantage emerges at low batch sizes. Because the weights are physically adjacent to the execution logic, Sophon delivers between 191x and 214x the weight bandwidth of an HBM4 package.

At low batch sizes, where execution is entirely memory-bandwidth bound, Sophon serves an 80B parameter model at 7,219 tokens/s in native BF16, or 14,438 tokens/s in FP8. When utilizing INT4 quantization and speculative decoding, the effective throughput rises to 72,188 tokens/s.

The Developer Angle: Compilation and the 330 GB Ceiling

For systems engineers and compiler writers, Sophon requires a complete shift in how models are prepared and executed. You cannot simply compile a standard PyTorch model and run it via a CUDA-like runtime.

First, the bit-serial activation broadcast means that activations must be sliced into bitplanes before they are sent to the CIM tiles. The compiler is responsible for scheduling these bit-serial broadcasts and managing the 8-level adder trees. This requires specialized compilation toolchains that map tensor operations directly to spatial coordinates on the 131,072-tile grid.

Second, the 330 GB memory capacity is a hard physical ceiling. In a traditional GPU cluster, if a model exceeds the local VRAM of a single card, you can rely on unified memory architectures or page to system memory over PCIe, albeit with a massive performance penalty. With Sophon, exceeding the 330 GB limit breaks the CIM execution model entirely.

When partitioning an 80B model for training, developers must fit the weights, the first-order optimizer state, and the activation cache within this 330 GB envelope. In BF16, an 80B model's weights consume 160 GB. First-order optimizer states require another 160 GB, leaving exactly 10 GB of headroom for gradient-checkpointed micro-batches.

+---------------------------------------------------------+
|                 Sophon PFG-1 330 GB RAM                 |
+----------------------------+----------------------------+
|  Weights (BF16): 160 GB    |  Optimizer State: 160 GB   |
+----------------------------+----------------------------+
|  Activation Headroom: 10 GB                             |
+---------------------------------------------------------+

If your model or context window requires more than 10 GB of activation space during training, you must partition the model across multiple Sophon dies using pipeline or tensor parallelism. This requires static scheduling at compile time, as dynamic memory allocation is virtually non-existent in this architecture.

The Economics of the Silicon Stack

From a procurement and infrastructure perspective, Sophon targets the most expensive component of modern AI hardware: HBM. According to financial analyses, a single NVIDIA Rubin NVL72 rack is estimated to cost around $7.8 million, with HBM memory alone accounting for approximately $2.0 million of that total.

By eliminating off-die HBM and the complex packaging (like CoWoS) required to link memory to logic, Sophon reduces the bill of materials (BOM) to $8,358 per die. This represents a 9.9x lower hardware BOM compared to a Rubin GPU, and an 11.6x reduction compared to AMD Instinct MI455X hardware.

For cloud providers and enterprise datacenters, the power savings are equally significant. Serving an 80B model in FP8 mode draws 373 W, yielding 38.7 tokens/s per watt. Under low-batch conditions, HBM4-bound GPUs are highly inefficient, drawing substantial power just to keep the HBM subsystem active while waiting for compute phases.

A Specialized Architecture for the LLM Era

Sophon is not a general-purpose processor. It will not replace GPUs for workloads that require massive, high-batch dense compute or highly dynamic memory allocation. The rigid 330 GB memory limit and the specialized bit-serial compilation model present real hurdles for rapid deployment.

However, for organizations running dedicated LLM inference pipelines or continuous fine-tuning on medium-sized models, the architecture offers an elegant escape from the memory wall. By merging memory and compute into a single, high-density 3D stack, it proves that the key to scaling AI performance is not wider memory buses, but shorter physical distances.

Sources & further reading

Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM — phantafield.com

#Hardware #Llm Inference #Asic #Semiconductors #Compute In Memory

Written by

Ji-ho Choi · Security & Cloud Editor

Ji-ho covers the increasingly tangled overlap between cloud architecture and security, drawing on a background as a penetration tester to keep his reporting grounded in real-world attack paths. He never lets a vendor claim go unquestioned and insists that every buzzword come with a proof of concept.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Ditching HBM: Inside the Monolithic 3D AI ASIC

The Physics of 2T0C Gain-Cell DRAM

Pure Digital Compute-In-Memory

The Developer Angle: Compilation and the 330 GB Ceiling

The Economics of the Silicon Stack

A Specialized Architecture for the LLM Era

Sources & further reading

Discussion 0

Related Reading

Arm at Exascale: Inside the New Number One Supercomputer

Pragmatic GitOps on AWS EKS: Beyond the Hello World Demo

IBM's 0.7nm Breakthrough and the Future of AI Compute

The Thermal Reality of IBM's Sub-1nm NanoStack