v1.0 — Live
OpenAI + Anthropic compatible

Token Factory API

A unified inference API for 60+ open-source and frontier models — from lightweight LLMs to image, audio, and embeddings. Switch the base URL in your existing OpenAI or Anthropic SDK and run on Token Factory infrastructure. No new SDK to learn, no migration to plan.

★

Now serving both OpenAI and Anthropic endpoints.
Use the openai SDK or the anthropic SDK against the same API key. Both translate to the same Token Factory router — your choice of client never changes pricing, performance, or model availability.

Choose your SDK

Both clients are first-class. Use whichever you already have in your stack — the rest of these docs cover both syntaxes side by side.

OpenAI SDK

Python · Node.js · Go · Rust

Drop-in compatible with /v1/chat/completions, /v1/embeddings, /v1/audio/*

Full support for tools, tool_choice, response_format, and stream

Works with LiteLLM, LangChain, LlamaIndex, Vercel AI SDK

Both openai & openai-agents libraries

pip install openai
npm i openai

Anthropic SDK

Python · Node.js · Go · Rust

Drop-in compatible with /v1/messages and /v1/messages/batches

Full support for system, tools, tool_use, thinking, and stream

Works with Claude Code, Cursor, Continue, Cline

Both anthropic & @anthropic-ai/sdk libraries

pip install anthropic
npm i @anthropic-ai/sdk

Base URLs

The same API key works for both. Pick the URL that matches your SDK:

OpenAIhttps://api.tokenfactory.ai/openai/v1/

Anthropichttps://api.tokenfactory.ai/anthropic/v1/

Quick integration — pick your SDK

Python · OpenAI SDK

from openai import OpenAI
client = OpenAI(

    base_url="https://api.tokenfactory.ai/openai/v1/",

    api_key="tf-your-api-key-here",

)
response = client.chat.completions.create(

    model="deepseek-ai/DeepSeek-R1-0528",

    messages=[

        {"role": "user", "content": "Explain quantum entanglement"}

    ],

)

print(response.choices[0].message.content)

JavaScript · OpenAI SDK

import OpenAI from 'openai';
const client = new OpenAI({

  baseURL: 'https://api.tokenfactory.ai/openai/v1/',

  apiKey: 'tf-your-api-key-here',

});
const response = await client.chat.completions.create({

  model: 'meta-llama/Llama-3.3-70B-Instruct',

  messages: [{ role: 'user', content: 'Hello!' }],

});

console.log(response.choices[0].message.content);

cURL · OpenAI-compatible

curl https://api.tokenfactory.ai/openai/v1/chat/completions

  -H "Content-Type: application/json"

  -H "Authorization: Bearer tf-your-api-key-here"

  -d '{

    "model": "meta-llama/Llama-3.3-70B-Instruct",

    "messages": [{"role": "user", "content": "Hello!"}]

  }'

Python · Anthropic SDK

import anthropic
client = anthropic.Anthropic(

    base_url="https://api.tokenfactory.ai/anthropic/",

    api_key="tf-your-api-key-here",

)
message = client.messages.create(

    model="claude-3-5-sonnet-20241022",

    max_tokens=1024,

    messages=[

        {"role": "user", "content": "Explain quantum entanglement"}

    ],

)

print(message.content[0].text)

JavaScript · Anthropic SDK

import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({

  baseURL: 'https://api.tokenfactory.ai/anthropic/',

  apiKey: 'tf-your-api-key-here',

});
const message = await client.messages.create({

  model: 'claude-3-5-sonnet-20241022',

  max_tokens: 1024,

  messages: [{ role: 'user', content: 'Hello!' }],

});

console.log(message.content[0].text);

cURL · Anthropic-compatible

curl https://api.tokenfactory.ai/anthropic/v1/messages

  -H "Content-Type: application/json"

  -H "x-api-key: tf-your-api-key-here"

  -H "anthropic-version: 2023-06-01"

  -d '{

    "model": "claude-3-5-sonnet-20241022",

    "max_tokens": 1024,

    "messages": [{"role": "user", "content": "Hello!"}]

  }'

Key concepts

⚙

API keys

All requests require a Bearer token. Get your key from the Dashboard. Keys start with tf-.

Token billing

Billed per token — input and output separately. Check the Models page for per-model rates.

⚡

UltraFast

Models with the UltraFast flag run on LPU hardware — up to 1,200 tokens/sec. Use for latency-critical apps.

⏱

Rate limits

Limits vary by plan. Check X-RateLimit-* and retry-after headers in responses.

Featured
Updated 2026-05-12

OpenAI & Anthropic compatibility

Token Factory serves both the OpenAI and Anthropic API surfaces from the same router. The two are not separate products — they’re two doors into the same model pool, same pricing, same API key, same usage dashboard. Use whichever client your stack already speaks.

Endpoint comparison

Capability

OpenAI endpoint

Anthropic endpoint

Status

Chat / Messages

/openai/v1/chat/completions

/anthropic/v1/messages

Live

Streaming (SSE)

stream=true

Live

Function calling

tools + tool_choice

tools + tool_use blocks

Live

JSON mode

response_format

tool use + schema

Live

Vision / Images

image_url content

image content blocks

Live

Batch

/v1/batches

/v1/messages/batches

Live

Embeddings

/openai/v1/embeddings

— (use OpenAI)

OpenAI only

Audio (ASR / TTS)

/openai/v1/audio/*

— (use OpenAI)

OpenAI only

Image generation

/openai/v1/images/*

— (use OpenAI)

OpenAI only

Model mapping

Both SDKs can call any Token Factory model — you reference the same model string regardless of which client you use:

Model

OpenAI alias

Anthropic alias

Notes

Claude 3.5 Sonnet

claude-3-5-sonnet-20241022

Both

Claude 3.5 Haiku

claude-3-5-haiku-20241022

Both

DeepSeek R1

deepseek-ai/DeepSeek-R1-0528

Both

GPT OSS 120B

openai/gpt-oss-120b

Both · ⚡ UltraFast

Llama 3.3 70B

meta-llama/Llama-3.3-70B-Instruct

Both

Qwen3 235B

Qwen/Qwen3-235B-A22B

Both

Headers & auth

OpenAIAuthorization: Bearer tf-your-key-here

Anthropicx-api-key: tf-your-key-here · anthropic-version: 2023-06-01

One key, two surfaces. A single tf- API key authorizes both endpoints. Usage, billing, and rate limits are aggregated across both surfaces in your dashboard.

Quickstart

First response in under 2 minutes. Pick the SDK you already have.

Prerequisites: Python 3.8+ or Node.js 18+. You’ll need an API key from your Dashboard.

Step 1 — Install the SDK

Bash

# Python

pip install openai

# Node.js npm install openai

Bash

# Python

pip install anthropic

# Node.js npm install @anthropic-ai/sdk

Step 2 — Set your API key

Bash

export TOKENFACTORY_API_KEY="tf-your-key-here"

Step 3 — Make your first request

Python · OpenAI SDK

import os

from openai import OpenAI
client = OpenAI(

    base_url="https://api.tokenfactory.ai/openai/v1/",

    api_key=os.environ["TOKENFACTORY_API_KEY"]

)
response = client.chat.completions.create(

    model="meta-llama/Llama-3.3-70B-Instruct",

    messages=[{"role": "user", "content": "Write a haiku about tokens"}],

    max_tokens=100

)

print(response.choices[0].message.content) # Tokens flow like streams, # Each word a unit of thought, # Models breathe them in.

Python · Anthropic SDK

import os

import anthropic
client = anthropic.Anthropic(

    base_url="https://api.tokenfactory.ai/anthropic/",

    api_key=os.environ["TOKENFACTORY_API_KEY"]

)
message = client.messages.create(

    model="claude-3-5-sonnet-20241022",

    max_tokens=100,

    messages=[{"role": "user", "content": "Write a haiku about tokens"}]

)

print(message.content[0].text) # Tokens flow like streams, # Each word a unit of thought, # Models breathe them in.

Choose a model

Select from 60+ models. Click any to copy the model ID — works with both SDKs.

deepseek-ai/DeepSeek-R1-0528

meta-llama/Llama-3.3-70B-Instruct

openai/gpt-oss-120b ⚡

Qwen/Qwen3-235B-A22B

claude-3-5-sonnet-20241022

mistralai/Mistral-Nemo-Instruct-2407

meta-llama/Llama-3.1-8B-Instruct

anthropic/claude-3-5-haiku ⚡

✓

That’s it. You’re now running on Token Factory via either SDK. Browse all 60+ models →

Authentication

A single tf- API key authorizes both the OpenAI and Anthropic endpoints. Pass it in whichever header your SDK expects.

OpenAIAuthorization: Bearer tf-your-key-here

Anthropicx-api-key: tf-your-key-here · anthropic-version: 2023-06-01

Never expose your API key in client-side JavaScript, public repositories, or logs. Rotate compromised keys immediately in your Dashboard.

Key format

All Token Factory API keys follow the format tf-{environment}-{random}:

Examples

# Production key

tf-prod-xk2m4r9q1a2b3c4d5e6f7g8h9i0j1k2l3m
# Staging key

tf-stag-m7p3n8q2b5c6d7e8f9g0h1i2j3k4l5m6n

# Development key tf-dev-r1n9s4t5u6v7w8x9y0z1a2b3c4d5e6f7g

Response headers

Every API response includes rate limit information in headers (identical schema for both endpoints):

Header

Type

Description

X-RateLimit-Limit

integer

Maximum requests allowed per minute for your plan

X-RateLimit-Remaining

integer

Requests remaining in current window

X-RateLimit-Reset

timestamp

Unix timestamp when the rate limit window resets

X-Token-Budget-Remaining

integer

Tokens remaining in your current plan window

retry-after

integer

Seconds to wait before retrying (on 429 responses)

Error codes

Token Factory uses standard HTTP status codes. The response shape matches the SDK that issued the request — OpenAI SDK calls return OpenAI-style error envelopes, Anthropic SDK calls return Anthropic-style envelopes.

HTTP Code

Error code

Description

200

—

Success

400

bad_request

Invalid request parameters. Check error.message for details.

401

unauthorized

Invalid or missing API key. Check your Authorization header.

403

forbidden

Your plan does not have access to this model or endpoint.

429

rate_limit_exceeded

Too many requests. See X-RateLimit-Reset header for reset time.

429

token_budget_exceeded

Token window depleted. Wait for window reset or add pay-as-you-go top-up.

500

server_error

Internal server error. Retry with exponential backoff.

503

model_unavailable

The requested model is temporarily unavailable. Try another model.

Cross-SDK error handling

Because both SDKs see the same Token Factory router, the HTTP status code is identical — only the SDK’s parsed error class differs. Handle by status, not by SDK type:

Python · works for both

try:

    response = client.messages.create(model="claude-3-5-sonnet-20241022", ...)

except anthropic.RateLimitError:

    time.sleep(1)

    # retry — works the same as openai.RateLimitError

OpenAI-compatible

Chat completions

Creates a completion for a chat conversation. Compatible with OpenAI’s chat.completions.create endpoint.

POST/openai/v1/chat/completions

Request body

Parameter

Type

Required

Description

model

string

Required

The model ID to use. See model list.

messages

array

Required

Array of message objects with role (system/user/assistant) and content.

max_tokens

integer

Optional

Maximum tokens to generate. Default: model’s max context.

temperature

float

Optional

Sampling temperature 0–2. Higher = more random. Default: 1.

stream

boolean

Optional

Whether to stream partial tokens via SSE. Default: false.

top_p

float

Optional

Nucleus sampling probability. Default: 1.

response_format

object

Optional

Set {"type": "json_object"} to force JSON output.

tools

array

Optional

List of tools the model may call. See function calling.

Example request

Python · OpenAI SDK

response = client.chat.completions.create(

    model="deepseek-ai/DeepSeek-R1-0528",

    messages=[

        {"role": "system", "content": "You are a helpful assistant."},

        {"role": "user", "content": "What is 2+2?"}

    ],

    temperature=0.7,

    max_tokens=256

)

Anthropic-compatible

Messages

Creates a message in a conversation. Compatible with Anthropic’s messages.create endpoint — supports system, tools, tool_use, thinking, and stream.

POST/anthropic/v1/messages

Request body

Parameter

Type

Required

Description

model

string

Required

The model ID (e.g. claude-3-5-sonnet-20241022).

messages

array

Required

Array of {"role": "user"|"assistant", "content": ...} messages.

max_tokens

integer

Required

Maximum tokens to generate. Must be > 0.

system

string

Optional

System prompt — passed as a top-level field, not a message.

temperature

float

Optional

Sampling temperature 0–1. Default: 1.

stream

boolean

Optional

Whether to stream partial tokens via SSE. Default: false.

tools

array

Optional

Tools the model may call. Schema: {name, description, input_schema}.

thinking

object

Optional

Enable extended thinking: {"type": "enabled", "budget_tokens": N}.

Example request

Python · Anthropic SDK

message = client.messages.create(

    model="claude-3-5-sonnet-20241022",

    max_tokens=1024,

    system="You are a helpful assistant.",

    messages=[

        {"role": "user", "content": "What is 2+2?"}

    ]

)

print(message.content[0].text)

Streaming

Receive tokens as they’re generated using Server-Sent Events (SSE). Both SDKs support streaming with the same stream=true flag.

Python · OpenAI SDK

with client.chat.completions.stream(

    model="openai/gpt-oss-120b",  # ⚡ UltraFast — 500 TPS

    messages=[{"role": "user", "content": "Tell me a story"}],

) as stream:

    for text in stream.text_stream:

        print(text, end="", flush=True)

Python · Anthropic SDK

with client.messages.stream(

    model="claude-3-5-sonnet-20241022",

    max_tokens=1024,

    messages=[{"role": "user", "content": "Tell me a story"}],

) as stream:

    for text in stream.text_stream:

        print(text, end="", flush=True)

UltraFast + streaming is the ideal combination for real-time chat UIs, voice interfaces, and interactive code editors. GPT OSS 20B and Claude 3.5 Haiku can stream at 1,000+ tokens/second on the LPU tier.

Function calling

Enable models to call external functions and APIs. Both SDKs work — the tool schema is normalized on our router.

Python · OpenAI SDK

tools = [{

    "type": "function",

    "function": {

        "name": "get_weather",

        "description": "Get weather for a city",

        "parameters": {

            "type": "object",

            "properties": {"city": {"type": "string"}},

            "required": ["city"]

        }

    }

}]

response = client.chat.completions.create( model="Qwen/Qwen3-235B-A22B", messages=[{"role": "user", "content": "What's the weather in London?"}], tools=tools, tool_choice="auto" )

Python · Anthropic SDK

tools = [{

    "name": "get_weather",

    "description": "Get weather for a city",

    "input_schema": {

        "type": "object",

        "properties": {"city": {"type": "string"}},

        "required": ["city"]

    }

}]

message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[{"role": "user", "content": "What's the weather in London?"}], tools=tools )

Embeddings

Create vector embeddings from text for semantic search, RAG, clustering, and similarity tasks. OpenAI-compatible only — use /openai/v1/embeddings.

POST/openai/v1/embeddings

Python · OpenAI SDK

response = client.embeddings.create(

    model="BAAI/bge-en-icl",

    input="Token Factory provides unified AI inference"

)

vector = response.data[0].embedding # 4096-dim float array print(len(vector)) # 4096

Speech recognition

Transcribe audio files to text. OpenAI-compatible only — use /openai/v1/audio/transcriptions. Supports MP3, WAV, M4A, WebM, MP4. Minimum billing: 10 seconds per request.

POST/openai/v1/audio/transcriptions

Python · OpenAI SDK

with open("audio.mp3", "rb") as audio_file:

    transcript = client.audio.transcriptions.create(

        model="whisper-large-v3",

        file=audio_file,

        language="en"  # optional

    )

print(transcript.text)

Whisper v3 Turbo is 5× cheaper ($0.05/hr) and 228× real-time. Use whisper-large-v3-turbo for bulk transcription.

Text-to-speech

Convert text to natural-sounding speech. Priced per million characters. OpenAI-compatible only.

POST/openai/v1/audio/speech

Python · OpenAI SDK

response = client.audio.speech.create(

    model="canopylabs/orpheus-v1-english",

    input="Hello, welcome to Token Factory.",

    voice="natural"

)

response.stream_to_file("output.mp3")

Image generation

Generate images from text prompts. Priced per image. OpenAI-compatible only.

POST/openai/v1/images/generations

Python · OpenAI SDK

response = client.images.generate(

    model="black-forest-labs/flux-schnell",

    prompt="A futuristic AI server farm at sunset",

    size="1024x1024",

    n=1

)

image_url = response.data[0].url

List models

Returns the list of all available models, their IDs, pricing, and capabilities. Works through both endpoints.

Python · OpenAI SDK

models = client.models.list()

for model in models.data:

    print(model.id, model.created)

Python · Anthropic SDK

# Anthropic SDK doesn't expose a list endpoint — use raw HTTP

import httpx

response = httpx.get(

    "https://api.tokenfactory.ai/anthropic/v1/models",

    headers={"x-api-key": "tf-your-key", "anthropic-version": "2023-06-01"}

)

print(response.json())

Usage statistics

Query your token consumption and request counts programmatically. Aggregates usage across both OpenAI and Anthropic endpoints.

GET/v1/usage

Python

import requests

response = requests.get( "https://api.tokenfactory.ai/v1/usage", headers={"Authorization": f"Bearer {api_key}"}, params={"period": "month", "surface": "all"} # "openai", "anthropic", or "all" ) print(response.json())

Cookbook

Ready-to-run recipes for common Token Factory use cases. All recipes work with both SDKs — pick the one already in your stack.

⌕

RAG pipeline

Build a production RAG system using BAAI embeddings + DeepSeek R1 for retrieval and reasoning.

View recipe →

◆

AI chat interface

Real-time streaming chat UI with GPT OSS 120B UltraFast. Under 100ms time-to-first-token.

View recipe →

◉

Voice assistant

End-to-end voice pipeline: Whisper (ASR) → Claude (LLM) → Orpheus (TTS). Full conversation in <2s.

View recipe →

▦

Batch processing

Process thousands of documents asynchronously using Llama 3.1 8B for classification at scale.

View recipe →

⚡

Claude Code on Token Factory

Point Claude Code at our Anthropic-compatible base URL and run on Token Factory infra with no setup.

View recipe →

⌘

Cursor + Token Factory

Configure Cursor’s OpenAI provider to use our base URL — every Cursor model is a Token Factory model.

View recipe →

SDKs & libraries

Token Factory is compatible with both the OpenAI and Anthropic SDKs. No custom library needed — use whichever client your stack already speaks.

OpenAI SDK — Python

pip install openai → set base_url="https://api.tokenfactory.ai/openai/v1/"

→

OpenAI SDK — Node.js / TypeScript

npm install openai → set baseURL: "https://api.tokenfactory.ai/openai/v1/"

→

Anthropic SDK — Python

pip install anthropic → set base_url="https://api.tokenfactory.ai/anthropic/"

→

Anthropic SDK — Node.js / TypeScript

npm install @anthropic-ai/sdk → set baseURL: "https://api.tokenfactory.ai/anthropic/"

→

⌘

Claude Code · Cursor · Continue · Cline

Point your editor’s Anthropic provider at https://api.tokenfactory.ai/anthropic/ — works out of the box.

→

Vercel AI SDK · LangChain · LlamaIndex · LiteLLM

Use Token Factory as your LLM provider in any agent framework. Pass baseURL + apiKey to your LLM connector.

→

{ }

REST / cURL

All endpoints follow the OpenAI or Anthropic REST spec. Any HTTP client works.

→

Rate limits

Rate limits are unified across both OpenAI and Anthropic endpoints — a request to either surface counts against the same per-key budget.

Plan

RPM

TPM

Token window

Test

10K

24 hours · 1M total

Token Pro

120

100K

5 hours · 3M total

Token Max

500

500K

5 hours · 5M total

Token Ultra

2,000

Monthly · 20M total

Rate limit exceeded? Handle HTTP 429 responses with exponential backoff starting at 1s. On token budget exhaustion, wait for window reset or add a pay-as-you-go top-up in your Dashboard.