AI Release

Inside NVIDIA Cosmos 3: Mixture-of-Transformers for Physical AI

The open-source platform unifies autoregressive reasoning and diffusion-based generation into a single architecture for robotics and autonomous systems.

Rachel Goldstein

Dev Tools Editor · Jun 12, 2026 · 4 min read

The term "Physical AI" has quickly become the industry's favorite shorthand for getting neural networks to interact with the messy, unpredictable physical world. While building a chatbot that hallucinates poetry is relatively low-stakes, building an autonomous system that must navigate a physical space without destroying itself or its surroundings requires a fundamentally different class of model.

To address this, NVIDIA has open-sourced NVIDIA Cosmos, a platform of world models, datasets, and tools designed specifically for robotics, autonomous vehicles, and smart infrastructure. At the center of this release is Cosmos 3, a family of omnimodal world models that attempts to unify vision-language processing, video generation, world simulation, and action forecasting into a single, cohesive framework.

The Mixture-of-Transformers (MoT) Architecture

Historically, developers have had to stitch together disparate models to build embodied agents: a vision-language model (VLM) for understanding, a diffusion model for generating synthetic training data, and a separate policy network for execution. Cosmos 3 attempts to bypass this fragmented pipeline by utilizing a unified Mixture-of-Transformers (MoT) architecture.

This MoT architecture combines two distinct paradigms within the same underlying transformer framework:

Autoregressive (AR) Transformer: Used for reasoning tasks. It processes language and visual understanding tokens via causal self-attention, enabling next-token prediction.
Diffusion Transformer (DM): Used for multimodal generation. It processes noisy image, video, audio, and action tokens through full attention to denoise and generate coherent outputs.

To keep these modalities aligned across space and time, Cosmos 3 shares its transformer architecture and multimodal attention layers across both modes. Crucially, it employs a unified 3D multi-dimensional rotary position embedding (mRoPE). This representation encodes spatial and temporal structures simultaneously, allowing the model to maintain consistent reasoning whether it is processing a static image, a streaming video, an audio track, or a robot's physical trajectory.

Two Runtime Surfaces: Reasoner vs. Generator

Cosmos 3 exposes its capabilities through two distinct runtime surfaces, depending on whether the application requires decision-making or simulation.

The Reasoner Surface

Inputs: Text, vision (images or video).
Outputs: Text.
Use Cases: Grounding, physical reasoning, task planning, action forecasting, and autonomous decision-making.

In Reasoner mode, the model acts as the cognitive engine. It analyzes visual inputs to generate textual outputs, such as identifying temporal events in a video, predicting next actions, or assessing the physical plausibility of a scene.

The Generator Surface

Inputs: Text, vision, sound, action.
Outputs: Vision, sound, action.
Use Cases: World simulation, future prediction, synthetic data generation, and policy learning.

In Generator mode, the model functions as a simulator. By accepting action sequences alongside sensory inputs, it can generate action-conditioned rollouts—essentially predicting "what happens next" if a robot takes a specific path. This is highly valuable for training reinforcement learning policies in simulation before deploying them to physical hardware.

The Model Family and Technical Specs

NVIDIA has released Cosmos 3 in two primary sizes, alongside specialized task-specific variants:

Cosmos3-Nano (16B): A compact, resource-efficient omnimodal model designed for edge deployment and fast iteration.
Cosmos3-Super (64B): A frontier-scale model optimized for advanced multimodal understanding and high-fidelity world simulation.
Specialized 64B Models: Dedicated variants including Cosmos3-Super-Text2Image and Cosmos3-Super-Image2Video for high-fidelity, temporally coherent generation.
Cosmos3-Nano-Policy-DROID (16B): A specialized vision-language robot policy pre-trained for DROID manipulation and control.

Generation and Input/Output Specifications

For developers looking to integrate these models into their pipelines, Cosmos 3 supports a highly flexible input/output specification. The models are tested in BF16 precision on Linux and are compatible with NVIDIA Ampere, Hopper, and Blackwell GPU architectures.

Parameter	Supported Values
Resolution Tiers	256p (320x192), 480p (832x480, default), 720p (1280x720)
Aspect Ratios	16:9, 4:3, 1:1, 3:4, 9:16
Frame Rates	10, 16, 24 (default), and 30 FPS
Frame Count	5 to 300 frames (default: 189)
Input Formats	Text strings, JPG/PNG/WEBP images, MP4 videos, JSON action arrays

The action conditioning is particularly granular, supporting varying dimensions depending on the target embodiment. This includes 9D camera motion, 9D autonomous vehicle controls, 57D egocentric motion, and 10D single-arm robot manipulation (compatible with DROID, UR, Fractal, and Bridge environments).

Integration and Deployment

NVIDIA has structured the Cosmos ecosystem to support both research-focused exploration and production-grade deployment.

For Python-first development and prototyping, developers can run the models using Hugging Face Diffusers (for generator workflows) and Transformers (for reasoner workflows).

When transitioning to production, the platform supports high-throughput, low-latency serving via vLLM and vLLM-Omni, providing OpenAI-compatible APIs. For enterprise deployments, Cosmos models can also be served via NVIDIA Inference Microservices (NIM).

Sources & further reading

NVIDIA/cosmos — github.com

#Open Source #Nvidia #Cosmos #Robotics #World Models

Written by

Rachel Goldstein · Dev Tools Editor

Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.

Discussion 1

Join the discussion

Dmitri Sokolov @ai_doomer_dmitri · 16 hours ago

i'm intrigued by the potential of cosmos 3, but i have to wonder about the potential risks of unifying autoregressive reasoning and diffusion-based generation in a single architecture - could this lead to unforeseen emergent behaviors in autonomous systems?

Inside NVIDIA Cosmos 3: Mixture-of-Transformers for Physical AI

The Mixture-of-Transformers (MoT) Architecture

Two Runtime Surfaces: Reasoner vs. Generator

The Reasoner Surface

The Generator Surface

The Model Family and Technical Specs

Generation and Input/Output Specifications

Integration and Deployment

Sources & further reading

Discussion 1

Related Reading

US Government Forces Anthropic to Pull Fable 5 and Mythos 5

Migrating TrueType Hinting to Swift: How Apple Beat C

WASI WebGPU Proposal Brings Portable GPU Acceleration to WebAssembly

LLMs Move Into CAD with Progressive Refinement Pipelines