Skip to content
AI Article

GateGPT: Running Transformers in Pure Digital Logic on FPGAs

By synthesizing a GPT model directly into hardware, GateGPT achieves massive throughput at a fraction of the clock speed.

Mariana Souza
Mariana Souza
Senior Editor · Jun 20, 2026 · 5 min read
GateGPT: Running Transformers in Pure Digital Logic on FPGAs

Edge AI is currently dominated by a software-on-hardware paradigm: we take massive general-purpose processors (GPUs, TPUs, or NPUs), load a software runtime, and stream model weights through memory. But what if the neural network is the hardware?

In June 2026, independent hardware engineer Fabio Guzman introduced GateGPT, an open-source Register Transfer Level (RTL) implementation of Andrej Karpathy's microGPT. Synthesized onto a 16-year-old Xilinx Virtex-5 FPGA (XC5VLX110T) running at a modest 80 MHz, GateGPT generates up to 69,200 tokens per second, with a sustained average of approximately 60,600 tokens per second.

This is not just an impressive retro-hardware hack. GateGPT demonstrates a credible, GPU-free inference baseline for edge deployments. By compiling a transformer directly into digital logic gates, it bypasses the instruction-fetch overhead of CPUs and the massive power envelopes of GPUs, proving that highly optimized, application-specific logic can deliver blazing-fast inference on minimal power and clock budgets.

The Architecture of GateGPT: Microcode ROM and Datapath Actuators

Instead of building a monolithic, rigid state machine to handle the transformer's operations, Guzman opted for a hybrid approach: a microcode-ROM sequencer architecture. This design is conceptually closer to a classic CPU than a traditional hardwired neural network accelerator.

At its core, a small program ROM contains macro-instructions. A micro-program counter fetches one macro-op per clock cycle, triggers a corresponding modular datapath actuator, and halts until it receives a "done" signal. This instruction schedule is encoded in a program ROM (ucode.hex), compiled by a custom assembler (tools/ucode_asm.py).

The heavy lifting is distributed across specialized, modular hardware blocks (actuators) that share a true dual-port Block RAM (BRAM) scratchpad called vmem. This scratchpad stores both active activations and the persistent Key-Value (KV) cache.

The actuators include:

  • matvec: A parallel multiply-accumulate tile designed for linear projections, capable of processing 24 lanes by 2 columns per cycle.
  • norm: An RMSNorm unit utilizing hardware-based unsigned division (udiv) and inverse square root (isqrt) primitives, processing 2 elements per cycle.
  • attn: A single-position multi-head causal attention block equipped with per-head parallel dividers.
  • exp_unit: A fixed-point exponential calculator using a 17-entry lookup table combined with linear interpolation.
  • sampler: A module that handles temperature-scaled softmax and Linear Congruential Generator (LCG) categorical sampling, or falls back to greedy argmax.
  • embed and vecop: Handle embedding lookups, residual additions, and ReLU activations.

The Hardware KV Cache and Q5.11 Fixed-Point Math

To fit a transformer into the limited logic of a 2008-era FPGA—occupying just 8% of the Virtex-5's resources—GateGPT employs aggressive optimization and strict numerical constraints.

The model uses signed Q5.11 fixed-point arithmetic. This 16-bit format allocates 5 bits for the integer part (including the sign) and 11 bits for the fractional part. Fixed-point math completely eliminates the need for complex, area-heavy floating-point units (FPUs), allowing the arithmetic logic to be synthesized into simple, fast adder and multiplier trees.

The architectural crown jewel of GateGPT's performance is its hardware-native KV cache. In software-based inference, managing the KV cache involves complex memory pointer manipulation and dynamic allocation. In GateGPT, the KV cache is baked directly into the vmem BRAM. Instead of recomputing the entire context window (up to 16 tokens) for every newly generated token, the attn actuator calculates only the K and V projections for the current token and appends them to the pre-allocated cache lines in vmem.

Through nine distinct optimization stages, Guzman increased the design's throughput by 28x—climbing from an initial 2,433 tokens/sec to the peak 69,200 tokens/sec. This massive speedup highlights the raw efficiency of hardware-level pipelining and memory-bandwidth matching.

The Developer Angle: Compiling to Silicon vs. Edge NPUs

For developers building edge AI applications—such as robotics, IoT sensors, or embedded medical devices—GateGPT represents a fork in the road.

Currently, edge AI relies on microcontrollers or low-power NPUs running lightweight runtimes like TensorFlow Lite or MicroTVM. While these platforms offer flexibility, they introduce layers of abstraction: compiler toolchains, runtime interpreters, and OS scheduling.

GateGPT offers an alternative: compiling the model directly to Register Transfer Level (RTL) using Python-based reference models (like Karpathy's microGPT) and synthesizing it into Verilog or VHDL.

Dimension Edge NPU / Microcontroller RTL-Synthesized Transformer (GateGPT)
Latency Milliseconds (variable due to OS/runtime overhead) Microseconds (deterministic, clock-cycle accurate)
Power Consumption Watts (typically 1W to 15W) Milliwatts (fraction of a watt at low clock speeds)
Flexibility High (swap models by loading a new binary) Low (requires re-synthesis and FPGA flashing)
Hardware Cost Medium to High (specialized silicon) Low (can run on cheap, legacy, or radiation-hardened FPGAs)

In practice, adopting a GateGPT-style workflow requires a shift in developer tooling. Instead of writing PyTorch code and exporting to ONNX, the workflow looks like this:

  1. Train and Quantize: Train a micro-model in PyTorch, quantizing weights to Q5.11 fixed-point.
  2. Generate Microcode: Use a tool like ucode_asm.py to compile the model's execution graph into a sequence of macro-instructions for the ROM.
  3. Synthesize and Route: Run the RTL through FPGA synthesis tools to map the actuators and memory blocks to the target silicon.
  4. Deploy: Flash the bitstream to the FPGA.

The obvious caveat is scale. GateGPT runs a tiny model (4,192 parameters, 27-character vocabulary). Scaling this architecture to a 1-billion parameter model is currently bottlenecked by FPGA on-chip memory (BRAM) capacity. However, for highly specialized, ultra-low-latency tasks—such as wake-word detection, real-time signal filtering, or local character-level parsing—this approach is unmatched.

A New Baseline for Edge AI

GateGPT is a compelling proof of concept that challenges the assumption that AI inference requires massive, power-hungry processors. By proving that a full transformer with a KV cache can run efficiently at just 80 MHz on legacy hardware, it opens the door for a new class of deterministic, ultra-low-power edge AI devices. For developers willing to venture into RTL and hardware synthesis, the reward is inference speed and efficiency that software runtimes simply cannot match.

Sources & further reading

  1. GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz — twitter.com
  2. GateGPT on FPGA: Achieving 56K Tokens/sec with Full Digital Logic Transformer - Sesame Disk — sesamedisk.com
Mariana Souza
Written by
Mariana Souza · Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading