All 33 models.
One API.

Groq UltraFast (7) on LPU silicon, plus the full Nebius Token Factory catalog (26) on H100 — reasoning, vision, embeddings, and open-weight chat — all through a single OpenAI-compatible interface.

Models in catalog

1,800

Max TPS

262K

Max context

Catalog

White documentation cards for every model. UltraFast models on LPU silicon carry the orange accent. Groq UltraFast (7) + Nebius Token Factory catalog (26), all with 20% standard markup on the base price.

33 models · sorted by category, then release date

Groq · OpenAI

UltraFast

GPT OSS 20B 128k

OpenAI’s open-weight 20B MoE (3.6B active, 32 experts) on Groq LPU silicon. 1,800 TPS — MXFP4 quantized, adjustable CoT effort, native tool use.

TPS: 1,800
Context: 128K
Input: $0.09/M
Output: $0.36/M

Live · 99.99% SLA
Try it →

Groq · OpenAI

UltraFast

GPT OSS Safeguard 20B

Safety-tuned sibling of GPT OSS 20B on Groq LPU. 1,800 TPS — content classification, jailbreak detection, and policy enforcement at sub-second latency.

TPS: 1,800
Context: 128K
Input: $0.09/M
Output: $0.36/M

Live · 99.99% SLA
Try it →

Groq · OpenAI

UltraFast

GPT OSS 120B 128k

OpenAI’s flagship 120B open-weight MoE (5.1B active, 128 experts) on Groq LPU. 1,000 TPS — near-frontier reasoning with adjustable CoT effort, Apache 2.0.

TPS: 1,000
Context: 128K
Input: $0.18/M
Output: $0.72/M

Live · 99.99% SLA
Try it →

Groq · Meta

UltraFast

Llama 4 Scout (17Bx16E) 128k

Meta’s MoE Llama 4 Scout (17B×16E experts) on Groq LPU. 1,200 TPS — 128K context, tuned for low-latency multi-turn chat and native function calling.

TPS: 1,200
Context: 128K
Input: $0.13/M
Output: $0.41/M

Live · 99.99% SLA
Try it →

Groq · Alibaba

UltraFast

Qwen3 32B 131k

Alibaba’s Qwen3 32B dense (hybrid thinking mode) on Groq LPU. 1,300 TPS — 131K context, 100+ languages, strong tool calling.

TPS: 1,300
Context: 131K
Input: $0.35/M
Output: $0.71/M

Live · 99.99% SLA
Try it →

Groq · Meta

UltraFast

Llama 3.3 70B Versatile 128k

Meta’s 70B production workhorse on Groq LPU. 1,100 TPS — high-throughput multi-turn chat and tool-calling agents at sub-second latency.

TPS: 1,100
Context: 128K
Input: $0.71/M
Output: $0.95/M

Live · 99.99% SLA
Try it →

Groq · Meta

UltraFast

Llama 3.1 8B Instant 128k

Meta’s smallest 8B on Groq LPU. 1,800 TPS — ultra-cheap inference for high-volume chat, classification, and routing workloads.

TPS: 1,800
Context: 128K
Input: $0.06/M
Output: $0.10/M

Live · 99.99% SLA
Try it →

NVIDIA · Nebius

REASONING

Nemotron-3-Ultra-550B-a55b

A 550B hybrid MoE (55B active) from NVIDIA on Nebius. Vision-capable, optimized for demanding multi-agent AI and complex reasoning.

TPS: 59
Context: 128K
Input: $1.20/M
Output: $3.60/M

Live · 99.9% SLA
Try it →

NVIDIA · Nebius

VISION

Cosmos3-Super-Reasoner

NVIDIA’s 35B vision-reasoning model on Nebius — optimized for complex video/image understanding, agentic AI tasks, and high-throughput inference.

TPS: 30
Context: 128K
Input: $0.12/M
Output: $0.36/M

Live · 99.9% SLA
Try it →

OpenBMB · Nebius

VISION

openbmb/MiniCPM-V-4.5

OpenBMB MiniCPM-V 4.5 — compact multimodal model for image, multi-image, high-FPS/single-video, OCR/PDf understanding with strong multilingual coverage.

TPS: 49.5
Context: 32K
Input: $0.07/M
Output: $0.13/M

Live · 99.9% SLA
Try it →

Moonshot AI · Nebius

REASONING

Kimi-K2.6

Kimi K2.6 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed vision-text tokens.

TPS: 60
Context: 256K
Input: $1.14/M
Output: $4.80/M

Live · 99.9% SLA
Try it →

DeepSeek · Nebius

REASONING

DeepSeek-V4-Pro

DeepSeek-V4 is designed for advanced reasoning, coding, and long-horizon agent workflows, with strong performance across the Nebius H100 cluster.

TPS: 24
Context: 128K
Input: $2.10/M
Output: $4.20/M

Live · 99.9% SLA
Try it →

NVIDIA · Nebius

OPEN

Nemotron-3-Nano-Omni

The most open, efficient, and accurate error-modal reasoning model for agentic AI — compact 30B MoE, runs at the fastest TPS in the Nemotron lineup.

TPS: 90
Context: 256K
Input: $0.07/M
Output: $0.29/M

Live · 99.9% SLA
Try it →

Z.ai · Nebius

REASONING

GLM-5.1

Zhipu AI’s latest flagship multimodal model with strong bilingual (Chinese–English) reasoning, long-context understanding, advanced agentic tool use.

TPS: 25
Context: 200K
Input: $1.68/M
Output: $5.28/M

Live · 99.9% SLA
Try it →

MiniMax · Nebius

OPEN

MiniMax-M2.5

Open-source agentic coding model built for polyglot development and precise refactoring, using retrieval-thinking tools to ground outputs.

TPS: 36.8
Context: 192K
Input: $0.36/M
Output: $1.44/M

Live · 99.9% SLA
Try it →

NVIDIA · Nebius

OPEN

Nemotron-3-Super-120b-a12b

Nemotron 3 Super is a 120B hybrid MoE model optimized for efficient multi-agent AI and complex reasoning tasks on the Nebius H100 cluster.

TPS: 127
Context: 128K
Input: $0.36/M
Output: $1.08/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

REASONING

Qwen3.5-397B-A17B

Multimodal MoE featuring a Hybrid Mixture-of-Experts architecture, designed for state-of-the-art performance across chat, retrieval, reasoning, and tool use.

TPS: 80
Context: 262K
Input: $0.72/M
Output: $4.32/M

Live · 99.9% SLA
Try it →

Z.ai · Nebius

REASONING

GLM-5

Zhipu AI’s latest flagship multimodal model with strong bilingual (Chinese–English) reasoning, long-context understanding, advanced agentic tool use.

TPS: 47
Context: 200K
Input: $1.20/M
Output: $3.84/M

Live · 99.9% SLA
Try it →

DeepSeek · Nebius

OPEN

DeepSeek-V3.2

A model designed to harmonize high compute efficiency with strong reasoning and agentic tool-use performance, served on Nebius H100.

TPS: 71
Context: 128K
Input: $0.36/M
Output: $0.54/M

Live · 99.9% SLA
Try it →

NousResearch · Nebius

REASONING

Hermes-4-405B

Hybrid reasoning model trained on verified CoT traces for strong math, coding, and step-by-step reliability. Frontier-tier open-weight.

TPS: 20
Context: 128K
Input: $1.20/M
Output: $3.60/M

Live · 99.9% SLA
Try it →

NousResearch · Nebius

REASONING

Hermes-4-70B

Compact version of Hermes-4 delivering high-quality reasoning, coding, and tool use with lower inference cost than its 405B sibling.

TPS: 20
Context: 128K
Input: $0.16/M
Output: $0.48/M

Live · 99.9% SLA
Try it →

OpenAI · Nebius

OPEN

gpt-oss-120b

Open-weight agentic model with configurable reasoning, full CoT visibility, strong tool use, and free-deployment support on Nebius.

TPS: 40
Context: 128K
Input: $0.18/M
Output: $0.72/M

Live · 99.9% SLA
Try it →

Prime Intellect · Nebius

REASONING

INTELLECT-3

A 100B-plus parameter Mixture-of-Experts model from Prime Intellect, fine-tuned with large-scale RL to deliver top-tier math, code, science, and reasoning.

TPS: 35
Context: 128K
Input: $0.24/M
Output: $1.20/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

OPEN

Qwen3-235B-A22B-Instruct-2507

Balanced Qwen3 flagship tuned for strong general reasoning, chat quality, and tool use at mid-size active cost.

TPS: 27
Context: 262K
Input: $0.24/M
Output: $0.72/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

OPEN

Qwen3-30B-A3B-Instruct-2507

Versatile 30B instruct model optimized for high-quality chat, reasoning, and coding at low cost.

TPS: 70
Context: 32K
Input: $0.12/M
Output: $0.36/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

EMBED

Qwen3-Embedding-8B

Qwen embedding model optimized for high-precision dense retrieval with multilingual coverage (100+ languages).

Dim: 4096
Lang: 100+
Input: $0.01/M
Output: —

Live · 99.9% SLA
Try it →

Alibaba · Nebius

REASONING

Qwen3-Next-80B-A3B-Thinking

Qwen’s “thinking-optimized” 80B model designed for sustained multi-step reasoning, structured deliberation, and high-precision multi-domain reasoning.

TPS: 85
Context: 262K
Input: $0.18/M
Output: $1.44/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

OPEN

Qwen3-32B

General model offering strong multilingual reasoning, coding, and long-context performance at mid scale.

TPS: 23
Context: 32K
Input: $0.12/M
Output: $0.36/M

Live · 99.9% SLA
Try it →

Google · Nebius

OPEN

Gemma-3-27b-it

Google’s mid-size model optimized for high-quality instruction following, coding, and multilingual performance.

TPS: 20
Context: 32K
Input: $0.12/M
Output: $0.36/M

Live · 99.9% SLA
Try it →

NVIDIA · Nebius

REASONING

Llama-3.1-Nemotron-Ultra-253B-v1

NVIDIA-tuned Llama variant built for high-efficiency reasoning, safety, and enterprise-grade performance.

TPS: 25
Context: 128K
Input: $0.72/M
Output: $2.16/M

Live · 99.9% SLA
Try it →

NVIDIA · Nebius

OPEN

Nemotron-3-Nano-30B-A3B

Compact MoE model optimized for efficient reasoning, chat, and coding with strong multilingual support and long-context RAG/agent workflows.

TPS: 60
Context: 256K
Input: $0.07/M
Output: $0.29/M

Live · 99.9% SLA
Try it →

Alibaba · Nebius

VISION

Qwen2.5-VL-72B-Instruct

High-end multimodal model delivering strong vision-language reasoning with long-context support.

TPS: 20
Context: 128K
Input: $0.30/M
Output: $0.90/M

Live · 99.9% SLA
Try it →

Meta · Nebius

OPEN

Llama-3.3-70B-Instruct

Refined Llama instruct model with strong reasoning, chat quality, and broad benchmark performance.

TPS: 25
Context: 128K
Input: $0.16/M
Output: $0.48/M

Live · 99.9% SLA
Try it →

/ Compare

Flagship models, side by side.

The numbers that matter: TPS, context, and price. Find the right model for your workload.

Model	Type	Platform	Context	Max TPS	Input / M	Output / M
GPT OSS 20B 128k	UltraFast · MoE	Groq · OpenAI	128K	1,800	$0.09	$0.36
GPT OSS Safeguard 20B	UltraFast · MoE	Groq · OpenAI	128K	1,800	$0.09	$0.36
GPT OSS 120B 128k	UltraFast · MoE	Groq · OpenAI	128K	1,000	$0.18	$0.72
Llama 4 Scout (17Bx16E) 128k	UltraFast · MoE	Groq · Meta	128K	1,200	$0.13	$0.41
Qwen3 32B 131k	UltraFast	Groq · Alibaba	131K	1,300	$0.35	$0.71
Llama 3.3 70B Versatile 128k	UltraFast	Groq · Meta	128K	1,100	$0.71	$0.95
Llama 3.1 8B Instant 128k	UltraFast	Groq · Meta	128K	1,800	$0.06	$0.10
Nemotron-3-Ultra-550B-a55b	Reasoning · MoE 550B (55B active)	NVIDIA · Nebius	128K	59	$1.20	$3.60
Cosmos3-Super-Reasoner	Reasoning · Vision	NVIDIA · Nebius	128K	30	$0.12	$0.36
openbmb/MiniCPM-V-4.5	Reasoning · Vision	OpenBMB · Nebius	32K	49.5	$0.07	$0.13
Kimi-K2.6	Reasoning · multimodal agentic	Moonshot AI · Nebius	256K	60	$1.14	$4.80
DeepSeek-V4-Pro	Reasoning · long-horizon agent	DeepSeek · Nebius	128K	24	$2.10	$4.20
GLM-5.1	Reasoning · multimodal	Z.ai · Nebius	200K	25	$1.68	$5.28
Qwen3.5-397B-A17B	Reasoning · MoE 397B (17B active)	Alibaba · Nebius	262K	80	$0.72	$4.32
GLM-5	Reasoning · multimodal	Z.ai · Nebius	200K	47	$1.20	$3.84
Hermes-4-405B	Reasoning · verified-CoT	NousResearch · Nebius	128K	20	$1.20	$3.60
Hermes-4-70B	Reasoning	NousResearch · Nebius	128K	20	$0.16	$0.48
INTELLECT-3	Reasoning · MoE 100B+ RL-tuned	Prime Intellect · Nebius	128K	35	$0.24	$1.20
Qwen3-Next-80B-A3B-Thinking	Reasoning · thinking-tuned MoE	Alibaba · Nebius	262K	85	$0.18	$1.44
Llama-3.1-Nemotron-Ultra-253B-v1	Reasoning · enterprise	NVIDIA · Nebius	128K	25	$0.72	$2.16
Nemotron-3-Nano-Omni	Open · error-modal reasoning	NVIDIA · Nebius	256K	90	$0.07	$0.29
MiniMax-M2.5	Open · agentic coding	MiniMax · Nebius	192K	36.8	$0.36	$1.44
Nemotron-3-Super-120b-a12b	Open · MoE 120B (12B active)	NVIDIA · Nebius	128K	127	$0.36	$1.08
DeepSeek-V3.2	Open · efficient reasoning	DeepSeek · Nebius	128K	71	$0.36	$0.54
gpt-oss-120b	Open · agentic MoE	OpenAI · Nebius	128K	40	$0.18	$0.72
Qwen3-235B-A22B-Instruct-2507	Open · MoE 235B (22B active)	Alibaba · Nebius	262K	27	$0.24	$0.72
Qwen3-30B-A3B-Instruct-2507	Open · MoE 30B (3B active)	Alibaba · Nebius	32K	70	$0.12	$0.36
Qwen3-32B	Open · general 32B	Alibaba · Nebius	32K	23	$0.12	$0.36
Gemma-3-27b-it	Open · mid-size instruct	Google · Nebius	32K	20	$0.12	$0.36
Nemotron-3-Nano-30B-A3B	Open · MoE 30B (3B active)	NVIDIA · Nebius	256K	60	$0.07	$0.29
Llama-3.3-70B-Instruct	Open · 70B instruct	Meta · Nebius	128K	25	$0.16	$0.48
Qwen2.5-VL-72B-Instruct	Vision · 72B multimodal	Alibaba · Nebius	128K	20	$0.30	$0.90
Qwen3-Embedding-8B	Embeddings · 4096-dim	Alibaba · Nebius	32K	—	$0.01	—