AI News

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

Through FP4 quantization, block-level speculative decoding, and the TileRT system stack, Xiaomi claims trillion-parameter decode speeds normally reserved for custom silicon — on a single 8-GPU node.

DevClubHouse Curation

Jun 8, 2026 · 4 min read · 0 comments

Xiaomi's MiMo team, working with the TileRT system team, says it has broken 1000 tokens/second decode speed on a 1-trillion-parameter model for the first time, with real-time generation peaking around ~1200 tps. The release, MiMo-V2.5-Pro-UltraSpeed, is notable less for the headline number than for how they claim to hit it: not on exotic hardware, but on a single standard 8-GPU commodity node.

For anyone benchmarking production LLM deployments, the interesting part is the engineering tradeoffs — and the asterisks attached to actually getting your hands on it.

The codesign play: FP4 + DFlash + TileRT

The stated approach is a deliberate rejection of the specialized-hardware route. Xiaomi explicitly contrasts itself with Cerebras's wafer-scale integration and Groq's on-chip SRAM architecture, arguing it reached comparable extreme speeds through model-system codesign on off-the-shelf GPUs instead.

Three pieces do the work:

FP4 (MXFP4) quantization. At 1T parameters, the team argues even 8-bit (FP8/INT8) inference imposes prohibitive memory footprint and bandwidth pressure. Dropping to 4-bit directly cuts memory-access overhead, which is the real bottleneck for decode throughput on commodity hardware. They describe MXFP4 as a "widely validated, virtually lossless" format — though they also concede that naively applying FP4 across the entire model degrades complex reasoning, logic, and code generation. Because MiMo-V2.5-Pro is a Mixture-of-Experts model, the experts make up the vast bulk of parameters, which is where the aggressive quantization is targeted.
DFlash speculative decoding. A method based on block-level masked parallel prediction, designed to increase the number of accepted tokens per verification step. More accepted tokens per verify pass means fewer expensive full-model passes per generated token — the core lever in any speculative-decoding scheme.
TileRT. The system layer provides a tailored compilation engine and compute kernels optimized specifically for this novel quantization-plus-speculative-decoding pipeline. The claim is that the kernels are matched to the dynamic behavior of the algorithms rather than bolted on after the fact.

The combination is what gets them to 1000+ tps on one 8-GPU node — a claim worth independent verification, but a coherent one architecturally. Bandwidth-bound decode + 4-bit weights + a speculative decoder that lands long accepted blocks is exactly the recipe you'd expect for throughput at this scale.

Why throughput at 1T changes the calculus

Xiaomi frames the speed as a paradigm shift rather than a faster typewriter, and two of its arguments are concrete for developers.

First, inference-time compute as a quality lever. When generation is fast enough, you can run dozens of reasoning paths in parallel — Best-of-N or tree search — and verify and self-correct in the background within the same wall-clock budget. Raw speed effectively buys you depth of reasoning. If your eval harness already leans on sampling many candidates, cheaper-per-second tokens directly improve achievable quality at fixed latency.

Second, coding agents. Latency is the tax every agentic loop pays on each tool call and each generation step. At 1000 tps, the per-step wait that throttles autonomous coding workflows largely collapses. The blog also pitches real-time decision loops — fraud interception, bidding, interactive dialogue — though those are framed as possibilities, not benchmarked deployments.

The catch: a two-week, application-gated window

This is where production planners should slow down. UltraSpeed is not generally available. It ships as an application-based trial running only June 9–23, 2026 (UTC+8), with slots limited and submission no guarantee of approval. Xiaomi says it will prioritize enterprises and professional developers with genuine business needs via platform.xiaomimimo.com/ultraspeed.

Pricing is blunt: 3× the cost of MiMo-V2.5-Pro for roughly 10× the generation speed, API-only, with no Token Plan support. Approved users also get free Chat access at ultraspeed.xiaomimimo.com, capped at 10 queue entries per account per day, 30-minute sessions, and auto-release after 5 minutes idle.

In other words: this reads as a capability demonstration and a resource-constrained pilot, not a stable endpoint you can architect around today. For standard, durable access, Xiaomi points developers to the regular MiMo-V2.5 model series.

The takeaway for evaluators: treat the 1000-tps figure as a signal of where commodity-GPU inference is heading — FP4 plus aggressive speculative decoding squeezing trillion-parameter models into single-node throughput — rather than a number you can provision against right now. The architecture is the news; the API is a two-week window.

#Llm #Inference #Quantization #Speculative Decoding #Xiaomi #Gpu

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Xiaomi's MiMo-V2.5-Pro-UltraSpeed Pushes a 1T Model Past 1000 Tokens/Sec on Commodity GPUs

The codesign play: FP4 + DFlash + TileRT

Why throughput at 1T changes the calculus

The catch: a two-week, application-gated window

Discussion 0

Related Reading

CopilotKit Bridges the Agent-to-UI Gap with Generative Components and the AG-UI Protocol

Agent Reach Gives AI Agents Live Eyes on Twitter, Reddit, and GitHub — No API Keys Required

Open Notebook: Self-Host Your Own NotebookLM with 18+ AI Providers

Supermemory: An Open-Source Memory and Context Engine for AI Apps