A unified inference API for 60+ open-source and frontier models — from lightweight LLMs to image, audio, and embeddings. Switch the base URL in your existing OpenAI or Anthropic SDK and run on Token Factory infrastructure. No new SDK to learn, no migration to plan.
★
Now serving both OpenAI and Anthropic endpoints.
Use the openai SDK or the anthropic SDK against the same API key. Both translate to the same Token Factory router — your choice of client never changes pricing, performance, or model availability.
Choose your SDK
Both clients are first-class. Use whichever you already have in your stack — the rest of these docs cover both syntaxes side by side.
O
OpenAI SDK
Python · Node.js · Go · Rust
Drop-in compatible with /v1/chat/completions, /v1/embeddings, /v1/audio/*
Full support for tools, tool_choice, response_format, and stream
Works with LiteLLM, LangChain, LlamaIndex, Vercel AI SDK
Both openai & openai-agents libraries
pip install openai npm i openai
A
Anthropic SDK
Python · Node.js · Go · Rust
Drop-in compatible with /v1/messages and /v1/messages/batches
Full support for system, tools, tool_use, thinking, and stream
Works with Claude Code, Cursor, Continue, Cline
Both anthropic & @anthropic-ai/sdk libraries
pip install anthropic npm i @anthropic-ai/sdk
Base URLs
The same API key works for both. Pick the URL that matches your SDK:
All requests require a Bearer token. Get your key from the Dashboard. Keys start with tf-.
$
Token billing
Billed per token — input and output separately. Check the Models page for per-model rates.
⚡
UltraFast
Models with the UltraFast flag run on LPU hardware — up to 1,200 tokens/sec. Use for latency-critical apps.
⏱
Rate limits
Limits vary by plan. Check X-RateLimit-* and retry-after headers in responses.
Featured Updated 2026-05-12
OpenAI & Anthropic compatibility
Token Factory serves both the OpenAI and Anthropic API surfaces from the same router. The two are not separate products — they’re two doors into the same model pool, same pricing, same API key, same usage dashboard. Use whichever client your stack already speaks.
Endpoint comparison
Capability
OpenAI endpoint
Anthropic endpoint
Status
Chat / Messages
/openai/v1/chat/completions
/anthropic/v1/messages
Live
Streaming (SSE)
stream=true
stream=true
Live
Function calling
tools + tool_choice
tools + tool_use blocks
Live
JSON mode
response_format
tool use + schema
Live
Vision / Images
image_url content
image content blocks
Live
Batch
/v1/batches
/v1/messages/batches
Live
Embeddings
/openai/v1/embeddings
— (use OpenAI)
OpenAI only
Audio (ASR / TTS)
/openai/v1/audio/*
— (use OpenAI)
OpenAI only
Image generation
/openai/v1/images/*
— (use OpenAI)
OpenAI only
Model mapping
Both SDKs can call any Token Factory model — you reference the same model string regardless of which client you use:
One key, two surfaces. A single tf- API key authorizes both endpoints. Usage, billing, and rate limits are aggregated across both surfaces in your dashboard.
Quickstart
First response in under 2 minutes. Pick the SDK you already have.
i
Prerequisites: Python 3.8+ or Node.js 18+. You’ll need an API key from your Dashboard.
# Development key
tf-dev-r1n9s4t5u6v7w8x9y0z1a2b3c4d5e6f7g
Response headers
Every API response includes rate limit information in headers (identical schema for both endpoints):
Header
Type
Description
X-RateLimit-Limit
integer
Maximum requests allowed per minute for your plan
X-RateLimit-Remaining
integer
Requests remaining in current window
X-RateLimit-Reset
timestamp
Unix timestamp when the rate limit window resets
X-Token-Budget-Remaining
integer
Tokens remaining in your current plan window
retry-after
integer
Seconds to wait before retrying (on 429 responses)
Error codes
Token Factory uses standard HTTP status codes. The response shape matches the SDK that issued the request — OpenAI SDK calls return OpenAI-style error envelopes, Anthropic SDK calls return Anthropic-style envelopes.
HTTP Code
Error code
Description
200
—
Success
400
bad_request
Invalid request parameters. Check error.message for details.
401
unauthorized
Invalid or missing API key. Check your Authorization header.
403
forbidden
Your plan does not have access to this model or endpoint.
429
rate_limit_exceeded
Too many requests. See X-RateLimit-Reset header for reset time.
429
token_budget_exceeded
Token window depleted. Wait for window reset or add pay-as-you-go top-up.
500
server_error
Internal server error. Retry with exponential backoff.
503
model_unavailable
The requested model is temporarily unavailable. Try another model.
Cross-SDK error handling
Because both SDKs see the same Token Factory router, the HTTP status code is identical — only the SDK’s parsed error class differs. Handle by status, not by SDK type:
Python · works for both
try:
response = client.messages.create(model="claude-3-5-sonnet-20241022", ...) except anthropic.RateLimitError:
time.sleep(1) # retry — works the same as openai.RateLimitError
OpenAI-compatible
Chat completions
Creates a completion for a chat conversation. Compatible with OpenAI’s chat.completions.create endpoint.
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "What is 2+2?"}
]
)
print(message.content[0].text)
Streaming
Receive tokens as they’re generated using Server-Sent Events (SSE). Both SDKs support streaming with the same stream=true flag.
Python · OpenAI SDK
with client.chat.completions.stream(
model="openai/gpt-oss-120b", # ⚡ UltraFast — 500 TPS
messages=[{"role": "user", "content": "Tell me a story"}],
) as stream: for text in stream.text_stream: print(text, end="", flush=True)
Python · Anthropic SDK
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Tell me a story"}],
) as stream: for text in stream.text_stream: print(text, end="", flush=True)
i
UltraFast + streaming is the ideal combination for real-time chat UIs, voice interfaces, and interactive code editors. GPT OSS 20B and Claude 3.5 Haiku can stream at 1,000+ tokens/second on the LPU tier.
Function calling
Enable models to call external functions and APIs. Both SDKs work — the tool schema is normalized on our router.
Ready-to-run recipes for common Token Factory use cases. All recipes work with both SDKs — pick the one already in your stack.
⌕
RAG pipeline
Build a production RAG system using BAAI embeddings + DeepSeek R1 for retrieval and reasoning.
View recipe →
◆
AI chat interface
Real-time streaming chat UI with GPT OSS 120B UltraFast. Under 100ms time-to-first-token.
View recipe →
◉
Voice assistant
End-to-end voice pipeline: Whisper (ASR) → Claude (LLM) → Orpheus (TTS). Full conversation in <2s.
View recipe →
▦
Batch processing
Process thousands of documents asynchronously using Llama 3.1 8B for classification at scale.
View recipe →
⚡
Claude Code on Token Factory
Point Claude Code at our Anthropic-compatible base URL and run on Token Factory infra with no setup.
View recipe →
⌘
Cursor + Token Factory
Configure Cursor’s OpenAI provider to use our base URL — every Cursor model is a Token Factory model.
View recipe →
SDKs & libraries
Token Factory is compatible with both the OpenAI and Anthropic SDKs. No custom library needed — use whichever client your stack already speaks.
O
OpenAI SDK — Python
pip install openai → set base_url="https://api.tokenfactory.ai/openai/v1/"
→
O
OpenAI SDK — Node.js / TypeScript
npm install openai → set baseURL: "https://api.tokenfactory.ai/openai/v1/"
→
A
Anthropic SDK — Python
pip install anthropic → set base_url="https://api.tokenfactory.ai/anthropic/"
→
A
Anthropic SDK — Node.js / TypeScript
npm install @anthropic-ai/sdk → set baseURL: "https://api.tokenfactory.ai/anthropic/"
→
⌘
Claude Code · Cursor · Continue · Cline
Point your editor’s Anthropic provider at https://api.tokenfactory.ai/anthropic/ — works out of the box.
→
JS
Vercel AI SDK · LangChain · LlamaIndex · LiteLLM
Use Token Factory as your LLM provider in any agent framework. Pass baseURL + apiKey to your LLM connector.
→
{ }
REST / cURL
All endpoints follow the OpenAI or Anthropic REST spec. Any HTTP client works.
→
Rate limits
Rate limits are unified across both OpenAI and Anthropic endpoints — a request to either surface counts against the same per-key budget.
Plan
RPM
TPM
Token window
Test
30
10K
24 hours · 1M total
Token Pro
120
100K
5 hours · 3M total
Token Max
500
500K
5 hours · 5M total
Token Ultra
2,000
2M
Monthly · 20M total
!
Rate limit exceeded? Handle HTTP 429 responses with exponential backoff starting at 1s. On token budget exhaustion, wait for window reset or add a pay-as-you-go top-up in your Dashboard.