Skip to content
AI Article

OpenAI Jalapeno and the Shift to Custom Inference Silicon

Custom ASICs are replacing general-purpose GPUs for running large language models to survive the crushing cost of scale.

Priya Nair
Priya Nair
AI & Developer Experience Writer · Jun 27, 2026 · 5 min read
OpenAI Jalapeno and the Shift to Custom Inference Silicon

The industry has spent years treating AI compute as a training problem, throwing massive GPU clusters at optimization. But as models move from research labs to production, the economic battleground has shifted to inference. Running a model is fundamentally different from training it, and using a general-purpose GPU for both is becoming an expensive compromise.

OpenAI and Broadcom recently unveiled Jalapeno, OpenAI's first custom application-specific integrated circuit (ASIC). This is not a general-purpose processor or a training accelerator. It is a chip built for one job: large language model inference.

This development marks a clear architectural shift. For developers and system architects, understanding why this shift is happening is key to predicting where API pricing, hosting options, and model deployment strategies are heading over the next few years.

The Memory-Bandwidth Bottleneck

To understand why OpenAI built Jalapeno, you have to look at where the time and power go during model serving.

During training, compute is highly dense. You process large batches of data at once, keeping the GPU's tensor cores saturated with matrix math. The system is compute-bound, meaning the speed of the arithmetic units limits performance.

During single-token decode (the process of generating text token by token), the equation flips. To generate a single token at small batch sizes, the system must stream the entire model's weights out of memory and through the compute units exactly once. The amount of arithmetic performed per byte read is incredibly low. This makes the workload memory-bandwidth-bound. The math units sit idle, waiting for data to arrive from memory.

On a standard GPU, memory and compute are separated by physical distance on the board. Data travels a long path, consuming time and electrical power. Jalapeno addresses this by placing eight high-bandwidth memory (HBM) stacks directly on-package, surrounding a single, reticle-sized compute chiplet. By moving the memory as close to the math units as physically possible, the design cuts down on the energy wasted shuffling data back and forth.

Inside the Silicon: Systolic Arrays vs. GPU Cores

A general-purpose GPU is like a commercial kitchen designed to cook anything on the menu. It has thousands of independent, highly programmable cores with complex instruction scheduling, caches, and control logic. This flexibility is necessary for rendering graphics, running physics simulations, or training new model architectures. But for running a finished model, much of that silicon is wasted overhead.

Jalapeno is a kitchen rebuilt to cook one dish. It uses a systolic array architecture, similar in concept to Google's TPU family.

  Data Input (Weights) ---> [ Cell ] ---> [ Cell ] ---> [ Cell ]
                               |             |             |
                               v             v             v
  Data Input (Activations)-> [ Cell ] ---> [ Cell ] ---> [ Cell ]
                               |             |             |
                               v             v             v
                            [ Output ]    [ Output ]    [ Output ]

In a systolic array, processing elements are arranged in a 2D grid. Data flows through the network in rhythmic lockstep, passing directly from cell to cell without constantly reading from and writing to local registers or cache. This design matches the dense matrix multiplications that dominate transformer inference. By hard-wiring this data flow, Jalapeno achieves high utilization of its math units while drawing far less power than a GPU running the same workload.

The Nine-Month Sprint

Designing a custom high-performance ASIC on a leading-edge node usually takes 18 to 24 months. OpenAI and Broadcom completed the design and taped out Jalapeno in roughly nine months.

Two factors accelerated this timeline:

  1. Hardware-Software Co-design: Because OpenAI owns the software stack, its engineers could provide Broadcom with precise kernel profiles, attention patterns, and serving requirements. The silicon was designed around the software, rather than software engineers having to write complex compilers to target generic hardware.
  2. AI-Assisted Layout: OpenAI used its own models to accelerate the physical design, optimization, and verification phases of the chip development process.

Manufactured on TSMC's 3nm process, engineering samples of Jalapeno are already running production workloads in OpenAI's labs, including GPT-5.3-Codex-Spark. Early testing reports performance-per-watt metrics substantially better than current state-of-the-art GPUs, with target cost savings of roughly 50 percent per inference token.

The Developer Angle: Preparing for the Commodity Token Era

You cannot buy a Jalapeno chip to put in your local server rack. Microsoft is expected to take 40 percent of the initial production run to deploy in Azure data centers, with prototype deployments starting in late 2026 and scaling through 2027 and 2028.

However, the existence of custom inference silicon changes how you should architect your applications today.

1. Prepare for the 50% Price Drop

If inference costs drop by half, agentic workflows that were previously cost-prohibitive become viable. Multi-agent systems that require dozens of background calls, self-reflection loops, and extensive chain-of-thought processing will no longer break the budget. When designing your application's architecture, do not optimize prematurely for minimal token usage at the expense of accuracy. Assume that token volume will become cheap, while latency and reliability remain your primary constraints.

2. Build Dual-Stack Fallbacks

As custom silicon fragments the hosting market, model availability and pricing will fluctuate based on where the hardware is deployed. To avoid vendor lock-in, build your applications with a dual-stack fallback strategy. Use abstract LLM clients that allow you to easily switch between cloud APIs and local, quantized models running on commodity hardware using tools like Ollama.

Provider Custom Chip Primary Use Case
Google TPU Training & Inference
Amazon Trainium / Inferentia Training & Inference
Microsoft Maia 100 Inference
Meta MTIA Inference
OpenAI Jalapeno Inference

3. Optimize Your Kernels, Not Just Your Code

If you run self-hosted models on cloud instances, start looking at how your serving frameworks handle memory bandwidth. Tools like vLLM and TensorRT-LLM use techniques like PagedAttention to optimize memory usage. As hardware becomes more specialized, the way you structure your model's context window and batching strategy will have a larger impact on your hosting bill than raw compute optimization.

The Bottom Line

OpenAI's move into custom silicon is a defensive play to protect its margins against crushing token delivery costs. But for the broader developer ecosystem, it signals the end of the general-purpose GPU's monopoly on AI execution.

We are entering an era of highly specialized, highly efficient inference engines. The developers who win this transition will be those who stop treating LLMs as expensive black boxes and start designing systems that assume abundant, cheap, and fast intelligence at the edge of the network.

Sources & further reading

  1. OpenAI and Broadcom's Jalapeño, a Custom Inference ASIC: Inference ASIC vs GPU — dev.to
  2. OpenAI Ships Jalapeño - Its First Custom AI Chip | Awesome Agents — awesomeagents.ai
  3. OpenAI and Broadcom reveal Jalapeno, first AI chip in partnership — cnbc.com
  4. OpenAI's First Custom AI Chip Targets 50% Cheaper Inference | MACGPU — macgpu.com
Priya Nair
Written by
Priya Nair · AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading