Now live — DeepSeek R1-0528 reasoning · UltraFast LPU inference available in all plans Explore → ×


v1.0 — Live
OpenAI + Anthropic compatible

Token Factory API

A unified inference API for 60+ open-source and frontier models — from lightweight LLMs to image, audio, and embeddings. Switch the base URL in your existing OpenAI or Anthropic SDK and run on Token Factory infrastructure. No new SDK to learn, no migration to plan.

Now serving both OpenAI and Anthropic endpoints.
Use the openai SDK or the anthropic SDK against the same API key. Both translate to the same Token Factory router — your choice of client never changes pricing, performance, or model availability.

Choose your SDK

Both clients are first-class. Use whichever you already have in your stack — the rest of these docs cover both syntaxes side by side.

O
OpenAI SDK
Python · Node.js · Go · Rust

  • Drop-in compatible with /v1/chat/completions, /v1/embeddings, /v1/audio/*
  • Full support for tools, tool_choice, response_format, and stream
  • Works with LiteLLM, LangChain, LlamaIndex, Vercel AI SDK
  • Both openai & openai-agents libraries
  • pip install openai
    npm i openai

    A
    Anthropic SDK
    Python · Node.js · Go · Rust

  • Drop-in compatible with /v1/messages and /v1/messages/batches
  • Full support for system, tools, tool_use, thinking, and stream
  • Works with Claude Code, Cursor, Continue, Cline
  • Both anthropic & @anthropic-ai/sdk libraries
  • pip install anthropic
    npm i @anthropic-ai/sdk

    Base URLs

    The same API key works for both. Pick the URL that matches your SDK:

    OpenAIhttps://api.tokenfactory.ai/openai/v1/
    Anthropichttps://api.tokenfactory.ai/anthropic/v1/

    Quick integration — pick your SDK






    Python · OpenAI SDK
    from openai import OpenAI

    client = OpenAI(
    base_url="https://api.tokenfactory.ai/openai/v1/",
    api_key="tf-your-api-key-here",
    )

    response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
    {"role": "user", "content": "Explain quantum entanglement"}
    ],
    )

    print(response.choices[0].message.content)

    JavaScript · OpenAI SDK
    import OpenAI from 'openai';

    const client = new OpenAI({
    baseURL: 'https://api.tokenfactory.ai/openai/v1/',
    apiKey: 'tf-your-api-key-here',
    });

    const response = await client.chat.completions.create({
    model: 'meta-llama/Llama-3.3-70B-Instruct',
    messages: [{ role: 'user', content: 'Hello!' }],
    });

    console.log(response.choices[0].message.content);

    cURL · OpenAI-compatible
    curl https://api.tokenfactory.ai/openai/v1/chat/completions
    -H "Content-Type: application/json"
    -H "Authorization: Bearer tf-your-api-key-here"
    -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
    }'

    Python · Anthropic SDK
    import anthropic

    client = anthropic.Anthropic(
    base_url="https://api.tokenfactory.ai/anthropic/",
    api_key="tf-your-api-key-here",
    )

    message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
    {"role": "user", "content": "Explain quantum entanglement"}
    ],
    )

    print(message.content[0].text)

    JavaScript · Anthropic SDK
    import Anthropic from '@anthropic-ai/sdk';

    const client = new Anthropic({
    baseURL: 'https://api.tokenfactory.ai/anthropic/',
    apiKey: 'tf-your-api-key-here',
    });

    const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [{ role: 'user', content: 'Hello!' }],
    });

    console.log(message.content[0].text);

    cURL · Anthropic-compatible
    curl https://api.tokenfactory.ai/anthropic/v1/messages
    -H "Content-Type: application/json"
    -H "x-api-key: tf-your-api-key-here"
    -H "anthropic-version: 2023-06-01"
    -d '{
    "model": "claude-3-5-sonnet-20241022",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello!"}]
    }'

    Key concepts

    API keys

    All requests require a Bearer token. Get your key from the Dashboard. Keys start with tf-.

    $

    Token billing

    Billed per token — input and output separately. Check the Models page for per-model rates.

    UltraFast

    Models with the UltraFast flag run on LPU hardware — up to 1,200 tokens/sec. Use for latency-critical apps.

    Rate limits

    Limits vary by plan. Check X-RateLimit-* and retry-after headers in responses.

    Featured
    Updated 2026-05-12

    OpenAI & Anthropic compatibility

    Token Factory serves both the OpenAI and Anthropic API surfaces from the same router. The two are not separate products — they’re two doors into the same model pool, same pricing, same API key, same usage dashboard. Use whichever client your stack already speaks.

    Endpoint comparison

    Capability
    OpenAI endpoint
    Anthropic endpoint
    Status

    Chat / Messages
    /openai/v1/chat/completions
    /anthropic/v1/messages
    Live

    Streaming (SSE)
    stream=true
    stream=true
    Live

    Function calling
    tools + tool_choice
    tools + tool_use blocks
    Live

    JSON mode
    response_format
    tool use + schema
    Live

    Vision / Images
    image_url content
    image content blocks
    Live

    Batch
    /v1/batches
    /v1/messages/batches
    Live

    Embeddings
    /openai/v1/embeddings
    — (use OpenAI)
    OpenAI only

    Audio (ASR / TTS)
    /openai/v1/audio/*
    — (use OpenAI)
    OpenAI only

    Image generation
    /openai/v1/images/*
    — (use OpenAI)
    OpenAI only

    Model mapping

    Both SDKs can call any Token Factory model — you reference the same model string regardless of which client you use:

    Model
    OpenAI alias
    Anthropic alias
    Notes

    Claude 3.5 Sonnet
    claude-3-5-sonnet-20241022
    claude-3-5-sonnet-20241022
    Both

    Claude 3.5 Haiku
    claude-3-5-haiku-20241022
    claude-3-5-haiku-20241022
    Both

    DeepSeek R1
    deepseek-ai/DeepSeek-R1-0528
    deepseek-ai/DeepSeek-R1-0528
    Both

    GPT OSS 120B
    openai/gpt-oss-120b
    openai/gpt-oss-120b
    Both · ⚡ UltraFast

    Llama 3.3 70B
    meta-llama/Llama-3.3-70B-Instruct
    meta-llama/Llama-3.3-70B-Instruct
    Both

    Qwen3 235B
    Qwen/Qwen3-235B-A22B
    Qwen/Qwen3-235B-A22B
    Both

    Headers & auth

    OpenAIAuthorization: Bearer tf-your-key-here
    Anthropicx-api-key: tf-your-key-here  ·  anthropic-version: 2023-06-01
    i
    One key, two surfaces. A single tf- API key authorizes both endpoints. Usage, billing, and rate limits are aggregated across both surfaces in your dashboard.

    Quickstart

    First response in under 2 minutes. Pick the SDK you already have.

    i
    Prerequisites: Python 3.8+ or Node.js 18+. You’ll need an API key from your Dashboard.

    Step 1 — Install the SDK


    Bash
    # Python
    pip install openai

    # Node.js
    npm install openai

    Bash
    # Python
    pip install anthropic

    # Node.js
    npm install @anthropic-ai/sdk

    Step 2 — Set your API key

    Bash
    export TOKENFACTORY_API_KEY="tf-your-key-here"

    Step 3 — Make your first request


    Python · OpenAI SDK
    import os
    from openai import OpenAI

    client = OpenAI(
    base_url="https://api.tokenfactory.ai/openai/v1/",
    api_key=os.environ["TOKENFACTORY_API_KEY"]
    )

    response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about tokens"}],
    max_tokens=100
    )

    print(response.choices[0].message.content)
    # Tokens flow like streams,
    # Each word a unit of thought,
    # Models breathe them in.

    Python · Anthropic SDK
    import os
    import anthropic

    client = anthropic.Anthropic(
    base_url="https://api.tokenfactory.ai/anthropic/",
    api_key=os.environ["TOKENFACTORY_API_KEY"]
    )

    message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=100,
    messages=[{"role": "user", "content": "Write a haiku about tokens"}]
    )

    print(message.content[0].text)
    # Tokens flow like streams,
    # Each word a unit of thought,
    # Models breathe them in.

    Choose a model

    Select from 60+ models. Click any to copy the model ID — works with both SDKs.

    deepseek-ai/DeepSeek-R1-0528
    meta-llama/Llama-3.3-70B-Instruct
    openai/gpt-oss-120b
    Qwen/Qwen3-235B-A22B
    claude-3-5-sonnet-20241022
    mistralai/Mistral-Nemo-Instruct-2407
    meta-llama/Llama-3.1-8B-Instruct
    anthropic/claude-3-5-haiku

    That’s it. You’re now running on Token Factory via either SDK. Browse all 60+ models →

    Authentication

    A single tf- API key authorizes both the OpenAI and Anthropic endpoints. Pass it in whichever header your SDK expects.

    OpenAIAuthorization: Bearer tf-your-key-here
    Anthropicx-api-key: tf-your-key-here  ·  anthropic-version: 2023-06-01
    !
    Never expose your API key in client-side JavaScript, public repositories, or logs. Rotate compromised keys immediately in your Dashboard.

    Key format

    All Token Factory API keys follow the format tf-{environment}-{random}:

    Examples
    # Production key
    tf-prod-xk2m4r9q1a2b3c4d5e6f7g8h9i0j1k2l3m

    # Staging key
    tf-stag-m7p3n8q2b5c6d7e8f9g0h1i2j3k4l5m6n

    # Development key
    tf-dev-r1n9s4t5u6v7w8x9y0z1a2b3c4d5e6f7g

    Response headers

    Every API response includes rate limit information in headers (identical schema for both endpoints):

    Header
    Type
    Description
    X-RateLimit-Limit
    integer
    Maximum requests allowed per minute for your plan
    X-RateLimit-Remaining
    integer
    Requests remaining in current window
    X-RateLimit-Reset
    timestamp
    Unix timestamp when the rate limit window resets
    X-Token-Budget-Remaining
    integer
    Tokens remaining in your current plan window
    retry-after
    integer
    Seconds to wait before retrying (on 429 responses)

    Error codes

    Token Factory uses standard HTTP status codes. The response shape matches the SDK that issued the request — OpenAI SDK calls return OpenAI-style error envelopes, Anthropic SDK calls return Anthropic-style envelopes.

    HTTP Code
    Error code
    Description
    200
    Success
    400
    bad_request
    Invalid request parameters. Check error.message for details.
    401
    unauthorized
    Invalid or missing API key. Check your Authorization header.
    403
    forbidden
    Your plan does not have access to this model or endpoint.
    429
    rate_limit_exceeded
    Too many requests. See X-RateLimit-Reset header for reset time.
    429
    token_budget_exceeded
    Token window depleted. Wait for window reset or add pay-as-you-go top-up.
    500
    server_error
    Internal server error. Retry with exponential backoff.
    503
    model_unavailable
    The requested model is temporarily unavailable. Try another model.

    Cross-SDK error handling

    Because both SDKs see the same Token Factory router, the HTTP status code is identical — only the SDK’s parsed error class differs. Handle by status, not by SDK type:

    Python · works for both
    try:
    response = client.messages.create(model="claude-3-5-sonnet-20241022", ...)
    except anthropic.RateLimitError:
    time.sleep(1)
    # retry — works the same as openai.RateLimitError

    OpenAI-compatible

    Chat completions

    Creates a completion for a chat conversation. Compatible with OpenAI’s chat.completions.create endpoint.

    POST/openai/v1/chat/completions

    Request body

    Parameter
    Type
    Required
    Description
    model
    string
    Required
    The model ID to use. See model list.
    messages
    array
    Required
    Array of message objects with role (system/user/assistant) and content.
    max_tokens
    integer
    Optional
    Maximum tokens to generate. Default: model’s max context.
    temperature
    float
    Optional
    Sampling temperature 0–2. Higher = more random. Default: 1.
    stream
    boolean
    Optional
    Whether to stream partial tokens via SSE. Default: false.
    top_p
    float
    Optional
    Nucleus sampling probability. Default: 1.
    response_format
    object
    Optional
    Set {"type": "json_object"} to force JSON output.
    tools
    array
    Optional
    List of tools the model may call. See function calling.

    Example request

    Python · OpenAI SDK
    response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
    ],
    temperature=0.7,
    max_tokens=256
    )

    Anthropic-compatible

    Messages

    Creates a message in a conversation. Compatible with Anthropic’s messages.create endpoint — supports system, tools, tool_use, thinking, and stream.

    POST/anthropic/v1/messages

    Request body

    Parameter
    Type
    Required
    Description
    model
    string
    Required
    The model ID (e.g. claude-3-5-sonnet-20241022).
    messages
    array
    Required
    Array of {"role": "user"|"assistant", "content": ...} messages.
    max_tokens
    integer
    Required
    Maximum tokens to generate. Must be > 0.
    system
    string
    Optional
    System prompt — passed as a top-level field, not a message.
    temperature
    float
    Optional
    Sampling temperature 0–1. Default: 1.
    stream
    boolean
    Optional
    Whether to stream partial tokens via SSE. Default: false.
    tools
    array
    Optional
    Tools the model may call. Schema: {name, description, input_schema}.
    thinking
    object
    Optional
    Enable extended thinking: {"type": "enabled", "budget_tokens": N}.

    Example request

    Python · Anthropic SDK
    message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
    {"role": "user", "content": "What is 2+2?"}
    ]
    )

    print(message.content[0].text)

    Streaming

    Receive tokens as they’re generated using Server-Sent Events (SSE). Both SDKs support streaming with the same stream=true flag.


    Python · OpenAI SDK
    with client.chat.completions.stream(
    model="openai/gpt-oss-120b", # ⚡ UltraFast — 500 TPS
    messages=[{"role": "user", "content": "Tell me a story"}],
    ) as stream:
    for text in stream.text_stream:
    print(text, end="", flush=True)

    Python · Anthropic SDK
    with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me a story"}],
    ) as stream:
    for text in stream.text_stream:
    print(text, end="", flush=True)

    i
    UltraFast + streaming is the ideal combination for real-time chat UIs, voice interfaces, and interactive code editors. GPT OSS 20B and Claude 3.5 Haiku can stream at 1,000+ tokens/second on the LPU tier.

    Function calling

    Enable models to call external functions and APIs. Both SDKs work — the tool schema is normalized on our router.


    Python · OpenAI SDK
    tools = [{
    "type": "function",
    "function": {
    "name": "get_weather",
    "description": "Get weather for a city",
    "parameters": {
    "type": "object",
    "properties": {"city": {"type": "string"}},
    "required": ["city"]
    }
    }
    }]

    response = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B",
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools,
    tool_choice="auto"
    )

    Python · Anthropic SDK
    tools = [{
    "name": "get_weather",
    "description": "Get weather for a city",
    "input_schema": {
    "type": "object",
    "properties": {"city": {"type": "string"}},
    "required": ["city"]
    }
    }]

    message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools
    )

    Embeddings

    Create vector embeddings from text for semantic search, RAG, clustering, and similarity tasks. OpenAI-compatible only — use /openai/v1/embeddings.

    POST/openai/v1/embeddings
    Python · OpenAI SDK
    response = client.embeddings.create(
    model="BAAI/bge-en-icl",
    input="Token Factory provides unified AI inference"
    )

    vector = response.data[0].embedding # 4096-dim float array
    print(len(vector)) # 4096

    Speech recognition

    Transcribe audio files to text. OpenAI-compatible only — use /openai/v1/audio/transcriptions. Supports MP3, WAV, M4A, WebM, MP4. Minimum billing: 10 seconds per request.

    POST/openai/v1/audio/transcriptions
    Python · OpenAI SDK
    with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio_file,
    language="en" # optional
    )
    print(transcript.text)

    i
    Whisper v3 Turbo is 5× cheaper ($0.05/hr) and 228× real-time. Use whisper-large-v3-turbo for bulk transcription.

    Text-to-speech

    Convert text to natural-sounding speech. Priced per million characters. OpenAI-compatible only.

    POST/openai/v1/audio/speech
    Python · OpenAI SDK
    response = client.audio.speech.create(
    model="canopylabs/orpheus-v1-english",
    input="Hello, welcome to Token Factory.",
    voice="natural"
    )
    response.stream_to_file("output.mp3")

    Image generation

    Generate images from text prompts. Priced per image. OpenAI-compatible only.

    POST/openai/v1/images/generations
    Python · OpenAI SDK
    response = client.images.generate(
    model="black-forest-labs/flux-schnell",
    prompt="A futuristic AI server farm at sunset",
    size="1024x1024",
    n=1
    )
    image_url = response.data[0].url

    List models

    Returns the list of all available models, their IDs, pricing, and capabilities. Works through both endpoints.


    Python · OpenAI SDK
    models = client.models.list()
    for model in models.data:
    print(model.id, model.created)

    Python · Anthropic SDK
    # Anthropic SDK doesn't expose a list endpoint — use raw HTTP
    import httpx
    response = httpx.get(
    "https://api.tokenfactory.ai/anthropic/v1/models",
    headers={"x-api-key": "tf-your-key", "anthropic-version": "2023-06-01"}
    )
    print(response.json())

    Usage statistics

    Query your token consumption and request counts programmatically. Aggregates usage across both OpenAI and Anthropic endpoints.

    GET/v1/usage
    Python
    import requests

    response = requests.get(
    "https://api.tokenfactory.ai/v1/usage",
    headers={"Authorization": f"Bearer {api_key}"},
    params={"period": "month", "surface": "all"} # "openai", "anthropic", or "all"
    )
    print(response.json())

    Cookbook

    Ready-to-run recipes for common Token Factory use cases. All recipes work with both SDKs — pick the one already in your stack.

    RAG pipeline

    Build a production RAG system using BAAI embeddings + DeepSeek R1 for retrieval and reasoning.

    View recipe →

    AI chat interface

    Real-time streaming chat UI with GPT OSS 120B UltraFast. Under 100ms time-to-first-token.

    View recipe →

    Voice assistant

    End-to-end voice pipeline: Whisper (ASR) → Claude (LLM) → Orpheus (TTS). Full conversation in <2s.

    View recipe →

    Batch processing

    Process thousands of documents asynchronously using Llama 3.1 8B for classification at scale.

    View recipe →

    Claude Code on Token Factory

    Point Claude Code at our Anthropic-compatible base URL and run on Token Factory infra with no setup.

    View recipe →

    Cursor + Token Factory

    Configure Cursor’s OpenAI provider to use our base URL — every Cursor model is a Token Factory model.

    View recipe →

    SDKs & libraries

    Token Factory is compatible with both the OpenAI and Anthropic SDKs. No custom library needed — use whichever client your stack already speaks.

    O

    OpenAI SDK — Python

    pip install openai → set base_url="https://api.tokenfactory.ai/openai/v1/"

    O

    OpenAI SDK — Node.js / TypeScript

    npm install openai → set baseURL: "https://api.tokenfactory.ai/openai/v1/"

    A

    Anthropic SDK — Python

    pip install anthropic → set base_url="https://api.tokenfactory.ai/anthropic/"

    A

    Anthropic SDK — Node.js / TypeScript

    npm install @anthropic-ai/sdk → set baseURL: "https://api.tokenfactory.ai/anthropic/"

    Claude Code · Cursor · Continue · Cline

    Point your editor’s Anthropic provider at https://api.tokenfactory.ai/anthropic/ — works out of the box.

    JS

    Vercel AI SDK · LangChain · LlamaIndex · LiteLLM

    Use Token Factory as your LLM provider in any agent framework. Pass baseURL + apiKey to your LLM connector.

    { }

    REST / cURL

    All endpoints follow the OpenAI or Anthropic REST spec. Any HTTP client works.

    Rate limits

    Rate limits are unified across both OpenAI and Anthropic endpoints — a request to either surface counts against the same per-key budget.

    Plan
    RPM
    TPM
    Token window
    Test
    30
    10K
    24 hours · 1M total
    Token Pro
    120
    100K
    5 hours · 3M total
    Token Max
    500
    500K
    5 hours · 5M total
    Token Ultra
    2,000
    2M
    Monthly · 20M total
    !
    Rate limit exceeded? Handle HTTP 429 responses with exponential backoff starting at 1s. On token budget exhaustion, wait for window reset or add a pay-as-you-go top-up in your Dashboard.