API Documentation
One base URL, all major LLM providers. OpenAI-compatible protocol means your existing code works instantly.
https://api.chuizi.ai/v1Get API Key Quick Start
Get up and running with chuizi.ai in under a minute. Use any OpenAI-compatible SDK — just change the base URL and API key.
Create an account
Sign up at chuizi.ai and add credits to your balance.
Generate an API key
Go to Console > API Keys and create a new key. Copy it immediately — it is shown only once.
Make your first request
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) response = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content)
Authentication
All API requests require a valid API key. Keys use the ck- prefix followed by 32 alphanumeric characters (e.g. ck-a1b2c3d4e5f6...). We store only the SHA-256 hash of each key — the full key is shown exactly once at creation time.
chuizi.ai accepts both authentication header formats. Use whichever your SDK expects:
# OpenAI convention (works with all protocols) Authorization: Bearer ck-your-key-here # Anthropic convention (also accepted on all protocols) x-api-key: ck-your-key-here
Model Naming
All models use the provider/model format (e.g. anthropic/claude-sonnet-4-6). This avoids naming collisions across providers and makes routing explicit. Bare model names (e.g. claude-sonnet-4-6) are accepted as aliases for convenience.
| Full Name | Alias | Provider |
|---|---|---|
| anthropic/claude-opus-4-6 | claude-opus-4-6 | Anthropic |
| anthropic/claude-opus-4 | claude-opus-4 | Anthropic |
| anthropic/claude-sonnet-4-6 | claude-sonnet-4-6 | Anthropic |
| anthropic/claude-sonnet-4 | claude-sonnet-4 | Anthropic |
| anthropic/claude-haiku-4-5 | claude-haiku-4-5 | Anthropic |
| openai/gpt-4o | gpt-4o | OpenAI |
| openai/gpt-4.1 | gpt-4.1 | OpenAI |
| openai/gpt-4.1-mini | gpt-4.1-mini | OpenAI |
| openai/o3 | o3 | OpenAI |
| openai/o4-mini | o4-mini | OpenAI |
| google/gemini-2.5-pro | gemini-2.5-pro | |
| google/gemini-2.5-flash | gemini-2.5-flash | |
| deepseek/deepseek-chat | deepseek-chat | DeepSeek |
| deepseek/deepseek-reasoner | deepseek-reasoner | DeepSeek |
This is a representative sample. See the full model list for all available models with real-time pricing.
OpenAI-Compatible Protocol
The /v1/* endpoints implement the full OpenAI Chat Completions API. Any SDK or tool that works with OpenAI will work with chuizi.ai — just change the base URL.
/v1/chat/completionsChat completions (streaming + non-streaming). Supports tools, vision, JSON mode.
/v1/modelsList all available models with pricing and capability metadata.
/v1/generation?id=gen-xxxQuery billing details for a specific request (see Billing section).
/v1/key/infoCurrent key info: balance, limits, usage (see Key Management section).
Request Schema
{ "model": "anthropic/claude-sonnet-4-6", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in one paragraph."} ], "temperature": 0.7, "max_tokens": 1024, "stream": false, "stream_options": {"include_usage": true}, "tools": [], "response_format": {"type": "text"} }
Response Schema
Responses follow the standard OpenAI format. chuizi.ai adds an x_chuizi extension field with request metadata:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1700000000, "model": "anthropic/claude-sonnet-4-6", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Quantum computing leverages quantum mechanical phenomena..." }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 24, "completion_tokens": 89, "total_tokens": 113 }, "x_chuizi": { "generation_id": "gen-abc123", "provider": "anthropic", "latency_ms": 1230, "cost": "0.00045600" } }
Anthropic Native Protocol
The /anthropic/* endpoints provide a near-passthrough proxy to the Anthropic Messages API. Request and response bodies are forwarded with minimal transformation — we only swap the auth key and extract usage for billing. This means full compatibility with Claude Code, Cursor, Cline, and any tool that speaks the Anthropic API natively.
/anthropic/v1/messagesAnthropic Messages API (streaming + non-streaming). Full feature parity.
/anthropic/v1/modelsList Anthropic models available through chuizi.ai.
Claude Code Setup (2 lines)
Add these to your shell profile and restart your terminal. Claude Code works instantly — no format conversion, no SDK changes.
export ANTHROPIC_BASE_URL=https://api.chuizi.ai/anthropic export ANTHROPIC_API_KEY=ck-your-key-here
Native Request Example
Requests use the standard Anthropic Messages API format. Include the anthropic-version header (required by the Anthropic SDK). Both x-api-key and Authorization: Bearer headers are accepted.
curl https://api.chuizi.ai/anthropic/v1/messages \
-H "x-api-key: ck-YOUR-KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello!"}]
}'Anthropic Response Format
{ "id": "gen-abc123", "type": "message", "role": "assistant", "content": [ {"type": "text", "text": "Hello! How can I help you today?"} ], "model": "claude-sonnet-4-6", "stop_reason": "end_turn", "usage": { "input_tokens": 12, "output_tokens": 14, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 0 } }
Anthropic SSE Streaming Events
When "stream": true is set, the response uses Anthropic's native SSE event types. These are passed through unmodified:
| Event Type | Description |
|---|---|
| message_start | Initial message object with model and usage metadata. |
| content_block_start | Start of a content block (text or tool_use). |
| content_block_delta | Incremental text delta within a content block. |
| content_block_stop | End of the current content block. |
| message_delta | Final usage stats (output tokens) and stop_reason. |
| message_stop | Stream is complete. No more events will follow. |
| ping | Keepalive event. Ignore in application logic. |
Gemini Native Protocol
The /gemini/* endpoints provide a native Google Gemini API proxy. Requests and responses are forwarded directly to Google AI — we only swap the auth key and extract usage for billing. Ideal for developers using the Google AI SDK.
/gemini/v1beta/models/*/generateContentGemini content generation (non-streaming)
/gemini/v1beta/models/*/streamGenerateContentGemini streaming content generation
/gemini/v1beta/modelsList available Gemini models
Request Example
Requests use the standard Gemini API format. Authentication is provided via x-api-key or Authorization: Bearer header (using your chuizi.ai ck- key, not a Google API key).
curl https://api.chuizi.ai/gemini/v1beta/models/gemini-2.5-pro:generateContent \
-H "x-api-key: ck-YOUR-KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"role": "user", "parts": [{"text": "Explain quantum computing in one paragraph."}]}
],
"generationConfig": {
"temperature": 0.7,
"maxOutputTokens": 1024
}
}'Response Format
{ "candidates": [ { "content": { "parts": [{"text": "Quantum computing harnesses quantum mechanical phenomena..."}], "role": "model" }, "finishReason": "STOP" } ], "usageMetadata": { "promptTokenCount": 12, "candidatesTokenCount": 89, "totalTokenCount": 101 } }
Streaming Generation
Use the streamGenerateContent endpoint with the ?alt=sse query parameter to receive Server-Sent Events streaming responses.
curl https://api.chuizi.ai/gemini/v1beta/models/gemini-2.5-pro:streamGenerateContent?alt=sse \
-H "x-api-key: ck-YOUR-KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"role": "user", "parts": [{"text": "Write a haiku about coding."}]}
]
}'SDK Examples
from google import genai client = genai.Client( api_key="ck-your-key-here", http_options={"api_version": "v1beta", "base_url": "https://api.chuizi.ai/gemini"}, ) response = client.models.generate_content( model="gemini-2.5-pro", contents="Explain quantum computing in one paragraph.", ) print(response.text)
Streaming
Streaming delivers response tokens as they are generated via Server-Sent Events (SSE). This dramatically reduces time-to-first-token and enables real-time UI updates.
OpenAI Streaming
Set "stream": true in your request body. To receive token usage in the final chunk, add "stream_options": {"include_usage": true} — this is critical for tracking costs.
{ "model": "anthropic/claude-sonnet-4-6", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true, "stream_options": {"include_usage": true} }
SSE Format
Each event is a line prefixed with data: followed by a JSON object. The stream ends with data: [DONE].
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"1"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":", 2"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}} data: [DONE]
Response Headers
Every response (streaming and non-streaming) includes these headers:
| Header | Description |
|---|---|
| X-Request-Id | Unique request ID for debugging. Include in support tickets. |
| Content-Type | text/event-stream for streaming, application/json otherwise. |
| Cache-Control | no-cache (streaming responses are never cached). |
Streaming Code Examples
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) stream = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[{"role": "user", "content": "Write a haiku about coding"}], stream=True, stream_options={"include_usage": True}, ) for chunk in stream: if chunk.choices and chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) if hasattr(chunk, "usage") and chunk.usage: print(f"\n\nTokens: {chunk.usage.total_tokens}")
When to Use Streaming
Use streaming for interactive UIs, chatbots, and any case where the user is waiting for a response. Use non-streaming for batch processing, background jobs, and when you need the complete response before proceeding. Streaming also provides faster time-to-first-token, which improves perceived latency even when total generation time is the same.
Billing & Usage
chuizi.ai uses a prepaid credit system. Add funds to your account, then usage is deducted per request. Per-request cost transparency with no hidden fees.
How Billing Works
Pre-deduct: Before the request is sent upstream, we estimate the cost based on input tokens and freeze that amount from your balance in Redis.
Process: The request is forwarded to the upstream provider. Tokens stream back to you in real-time.
Reconcile: After the response completes, the actual token usage from the provider determines the final cost. The frozen estimate is released and the real amount is deducted. A BullMQ worker writes the transaction to PostgreSQL.
Pricing Formula
cost = input_tokens * input_price + output_tokens * output_price # Example: anthropic/claude-sonnet-4-6 # $3.15 / 1M input, $15.75 / 1M output # # 1000 input + 500 output tokens: # cost = (1000 * 3.15/1e6 + 500 * 15.75/1e6) = $0.011025
Auto-Recharge
Auto-recharge lets you maintain a minimum balance without manual top-ups. Set a balance threshold and recharge amount in your billing settings. When your balance drops below the threshold, your saved payment method is automatically charged for the configured amount.
Generation API
Every request generates a unique gen--prefixed ID. Query full billing details for any request:
/v1/generation?id=gen-xxxQuery detailed billing and metadata for a specific request.
curl https://api.chuizi.ai/v1/generation?id=gen-abc123 \ -H "Authorization: Bearer ck-YOUR-KEY"
{ "data": { "id": "gen-abc123", "model": "anthropic/claude-sonnet-4-6", "provider": "anthropic", "input_tokens": 150, "output_tokens": 89, "native_input_tokens": 150, "native_output_tokens": 89, "cached_tokens": 0, "reasoning_tokens": 0, "cost": "0.00045600", "upstream_cost": "0.00035077", "latency_ms": 1230, "generation_time_ms": 1180, "finish_reason": "stop", "streamed": true, "created_at": "2025-01-15T10:30:00Z" } }
Generation Response Fields
| Field | Type | Description |
|---|---|---|
| id | string | Unique generation ID (gen_ prefix). |
| model | string | Full provider/model name used for the request. |
| provider | string | Upstream provider (anthropic, openai, google, deepseek). |
| input_tokens | number | Number of input (prompt) tokens billed. |
| output_tokens | number | Number of output (completion) tokens billed. |
| native_input_tokens | number | Token count as reported by the upstream provider. |
| native_output_tokens | number | Output token count as reported by the upstream provider. |
| cached_tokens | number | Input tokens served from cache (reduced cost). |
| reasoning_tokens | number | Internal reasoning tokens (o3, DeepSeek R1). Billed as output. |
| cost | string | Total cost charged to your balance (Decimal string, USD). |
| upstream_cost | string | Actual upstream provider cost. |
| latency_ms | number | Total round-trip latency including network (milliseconds). |
| generation_time_ms | number | Time spent generating the response at the provider. |
| finish_reason | string | Why generation stopped: stop, length, tool_calls, content_filter. |
| streamed | boolean | Whether the request used streaming. |
| created_at | string | ISO 8601 timestamp of request creation. |
Cache & Reasoning Tokens
Cache Tokens
Some providers cache repeated prompt prefixes, dramatically reducing cost for subsequent requests with the same system prompt or conversation history. Cached input tokens are billed at a fraction of the normal input price.
| Provider | Models | Cache Discount | How to Enable |
|---|---|---|---|
| Anthropic | All Claude models | 90% off input price | Automatic for repeated prefixes. Use cache_control for explicit control. |
| OpenAI | GPT-4o, GPT-4.1, o3 | 50% off input price | Automatic for prompts > 1024 tokens with shared prefix. |
| DeepSeek | DeepSeek-Chat, DeepSeek-R1 | 90% off input price | Automatic for repeated prefixes (context caching). |
| Gemini 2.5 Pro/Flash | 75% off input price | Use cachedContent API or automatic prefix caching. |
Cache tokens appear in the cached_tokens field of the Generation API response and in the Anthropic usage.cache_read_input_tokens field. Keep your system prompt stable across requests to maximize cache hits.
Reasoning Tokens
Reasoning models (OpenAI o3, o3-mini, DeepSeek R1) generate internal "thinking" tokens before producing the final output. These tokens are not visible in the response but are billed as output tokens.
| Model | Reasoning Tokens | Billing |
|---|---|---|
| openai/o3 | Up to 100K tokens per request | Billed at output token rate |
| openai/o4-mini | Up to 100K tokens per request | Billed at output token rate |
| deepseek/deepseek-reasoner | Variable (shown in response) | Billed at output token rate |
Complete Pricing Formula
cost = (input_tokens - cached_tokens) * input_price
+ cached_tokens * cached_input_price
+ (output_tokens + reasoning_tokens) * output_price
# Example: anthropic/claude-sonnet-4-6 with cache hit
# 2000 input tokens, 1500 cached, 500 output tokens
# Input price: $3.15/1M, Cached: $0.315/1M, Output: $15.75/1M
#
# cost = 500 * 3.15/1e6 # non-cached input: $0.001575
# + 1500 * 0.315/1e6 # cached input: $0.0004725
# + 500 * 15.75/1e6 # output: $0.007875
# = $0.009923Error Handling
All errors follow the OpenAI-compatible error format, regardless of which protocol you use. This makes error handling consistent across /v1 and /anthropic paths.
{ "error": { "message": "Insufficient balance. Please top up at https://chuizi.ai/billing", "type": "insufficient_quota", "code": "402" } }
Error Codes Reference
| Code | Type | Description |
|---|---|---|
| 400 | invalid_request_error | Malformed request body, missing required fields, invalid model name, or unsupported parameters. |
| 401 | authentication_error | Missing, invalid, or expired API key. Check the Authorization or x-api-key header. |
| 402 | insufficient_quota | Account balance too low for the estimated request cost. Top up your balance. |
| 403 | permission_error | Key is inactive, IP not in whitelist, model not in allowed_models, or daily spend cap reached. |
| 403 | model_not_allowed | Your API key does not have access to this model. |
| 403 | ip_not_allowed | Request IP is not in the key's whitelist. |
| 403 | daily_limit_exceeded | Daily spending limit reached for this key. |
| 404 | not_found | Requested model or endpoint does not exist. Check model naming format. |
| 429 | rate_limit_error | RPM or TPM limit exceeded for your tier. Check X-RateLimit-Reset header for retry time. |
| 500 | internal_error | Internal chuizi.ai server error. Retry with exponential backoff. Contact support if persistent. |
| 502 | upstream_error | Upstream provider returned an error. The provider may be experiencing issues. |
| 503 | service_unavailable | Upstream provider is overloaded or temporarily unavailable. Retry after a short delay. |
Rate Limits
chuizi.ai has no rate limits by default. Pay for what you use with unlimited RPM/TPM. Per-key limits can be configured in the console if needed.
Default Limits
The table below shows default RPM (requests per minute) limits for each tier. Rate limits are per-key configurable. Set custom RPM in your API key settings.
| Type | RPM |
|---|---|
| Free | 60 |
| Starter | 120 |
| Pro | 300 |
| Enterprise | Custom |
Rate Limit Headers
Every response includes rate limit information in the headers:
| Header | Description |
|---|---|
| X-RateLimit-Limit | Maximum requests per minute for this key. |
| X-RateLimit-Remaining | Requests remaining in the current window. |
| X-RateLimit-Reset | Unix timestamp (seconds) when the window resets. |
Per-Key Customization
Override your tier's default RPM on individual keys via the Console or API. Set rpm_limit when creating/updating a key. You can also set a daily_limit (in USD) to cap spending — the key returns 403 once the daily cap is reached and resets at midnight UTC.
Claude Code Setup
Claude Code uses the Anthropic API natively. Set two environment variables in your shell profile and you are done — no format conversion, no SDK changes, no wrappers.
# Add to ~/.zshrc or ~/.bashrc export ANTHROPIC_BASE_URL=https://api.chuizi.ai/anthropic export ANTHROPIC_API_KEY=ck-your-key-here # Restart terminal, then run claude as usual source ~/.zshrc claude
Cursor Setup
Cursor supports custom API endpoints for Anthropic models. Configure the base URL in Cursor settings to route all Claude requests through chuizi.ai.
Open Cursor Settings (Cmd/Ctrl + ,)
Navigate to Models > Anthropic
Set API Base URL to https://api.chuizi.ai/anthropic
Enter your chuizi.ai API key (ck-...) as the API Key
https://api.chuizi.ai/anthropic
Cline Setup
Cline (VS Code extension) supports custom API providers. Configure it to use chuizi.ai as the Anthropic API endpoint.
Open Cline extension settings in VS Code
Set API Provider to "Anthropic" with Custom Base URL: https://api.chuizi.ai/anthropic
Enter your chuizi.ai API key (ck-...) as the API Key
https://api.chuizi.ai/anthropic
OpenCode Setup
OpenCode reads provider configuration from its config file. Point the Anthropic provider to chuizi.ai.
[providers.anthropic] base_url = "https://api.chuizi.ai/anthropic" api_key = "ck-your-key-here"
Other Integrations
n8n
Use chuizi.ai in n8n by adding an OpenAI node with custom credentials. All OpenAI-compatible operations (chat, completions, embeddings) work out of the box.
Base URL: https://api.chuizi.ai/v1 API Key: ck-your-key-here # In n8n: # 1. Add an "OpenAI" node # 2. Create a new credential: # - API Key: ck-your-key-here # - Base URL: https://api.chuizi.ai/v1 # 3. Select any model (e.g. anthropic/claude-sonnet-4-6) # 4. Works with Chat operations
LangChain
LangChain works with chuizi.ai through the OpenAI-compatible interface. No extra libraries or adapters needed — just use ChatOpenAI and change the base URL.
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", model="anthropic/claude-sonnet-4-6", ) response = llm.invoke("Explain quantum computing in one sentence.") print(response.content)
Vercel AI SDK
The Vercel AI SDK connects to chuizi.ai via @ai-sdk/openai's createOpenAI helper. All features including generateText, streamText, and tool calling work seamlessly.
import { createOpenAI } from "@ai-sdk/openai"; import { generateText } from "ai"; const chuizi = createOpenAI({ baseURL: "https://api.chuizi.ai/v1", apiKey: "ck-your-key-here", }); const { text } = await generateText({ model: chuizi("anthropic/claude-sonnet-4-6"), prompt: "Explain quantum computing in one sentence.", }); console.log(text);
Key Management
Each API key can be configured with fine-grained access controls. Set these when creating a key in the Console, or update them later.
| Property | Type | Description |
|---|---|---|
| name | string | Human-readable label (e.g. "Production", "CI/CD") |
| allowed_models | string[] | Restrict to specific models. Empty = all models allowed. |
| ip_whitelist | string[] | Restrict to specific IP addresses or CIDR ranges. Empty = any IP. |
| rpm_limit | number | Per-key requests per minute override. Falls back to tier default. |
| daily_limit | string | Maximum daily spend in USD (e.g. "50.00"). Resets at midnight UTC. |
| is_active | boolean | Disable a key without deleting it. Disabled keys return 403. |
Query Key Info
Check your key's current status, balance, and limits programmatically:
/v1/key/infoReturns current key metadata, balance, and usage limits.
curl https://api.chuizi.ai/v1/key/info \ -H "Authorization: Bearer ck-YOUR-KEY"
{ "key_prefix": "ck-a1b2", "name": "Production", "group": "default", "tier": "pro", "balance": "142.35000000", "is_active": true, "allowed_models": ["anthropic/claude-sonnet-4-6", "openai/gpt-4o"], "ip_whitelist": [], "rpm_limit": 300, "daily_limit": "100.00", "last_used_at": "2025-06-15T14:22:00Z", "created_at": "2025-06-01T08:00:00Z" }
SDKs & Examples
chuizi.ai works with any OpenAI-compatible SDK. No custom libraries needed — just change the base URL and API key.
Python — Streaming
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) stream = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[{"role": "user", "content": "Write a haiku about coding"}], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
Node.js — Streaming
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.chuizi.ai/v1", apiKey: "ck-your-key-here", }); const stream = await client.chat.completions.create({ model: "anthropic/claude-sonnet-4-6", messages: [{ role: "user", content: "Write a haiku about coding" }], stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content || ""); }
Python — Tool Use (Function Calling)
import json from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } } }] # Step 1: Initial request with tools response = client.chat.completions.create( model="openai/gpt-4o", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, ) message = response.choices[0].message # Step 2: If the model wants to call a tool if message.tool_calls: tool_call = message.tool_calls[0] args = json.loads(tool_call.function.arguments) # Your function implementation weather_data = {"temp": "22C", "condition": "Sunny"} # Step 3: Send the tool result back follow_up = client.chat.completions.create( model="openai/gpt-4o", messages=[ {"role": "user", "content": "What's the weather in Tokyo?"}, message, # assistant message with tool_calls { "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(weather_data), }, ], tools=tools, ) print(follow_up.choices[0].message.content)
curl — Non-Streaming
curl https://api.chuizi.ai/v1/chat/completions \
-H "Authorization: Bearer ck-YOUR-KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain REST APIs in one sentence."}
],
"temperature": 0.3,
"max_tokens": 100
}'curl — Anthropic Native Protocol
curl https://api.chuizi.ai/anthropic/v1/messages \
-H "x-api-key: ck-YOUR-KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"system": "You are a senior software engineer.",
"messages": [
{"role": "user", "content": "Review this code: function add(a,b){return a+b}"}
]
}'Best Practices
Error Recovery with Exponential Backoff
Always implement retry logic with exponential backoff for transient errors (429, 500, 502, 503). Here is a production-ready implementation:
import time import random from openai import OpenAI, APIError, RateLimitError, APIConnectionError client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) def chat_with_retry(messages, model="anthropic/claude-sonnet-4-6", max_retries=5): """Make a chat request with exponential backoff retry.""" for attempt in range(max_retries): try: return client.chat.completions.create( model=model, messages=messages, ) except RateLimitError as e: if attempt == max_retries - 1: raise # Use Retry-After header if available, otherwise exponential backoff wait = float(e.response.headers.get("retry-after", 2 ** attempt)) wait += random.uniform(0, 1) # jitter print(f"Rate limited. Retrying in {wait:.1f}s...") time.sleep(wait) except (APIError, APIConnectionError) as e: if attempt == max_retries - 1: raise wait = min(2 ** attempt + random.uniform(0, 1), 30) print(f"Error: {e}. Retrying in {wait:.1f}s...") time.sleep(wait) response = chat_with_retry([{"role": "user", "content": "Hello!"}]) print(response.choices[0].message.content)
Cost Optimization
Maximize cache tokens
Keep your system prompt identical across requests to the same model. Anthropic and OpenAI automatically cache repeated prompt prefixes, saving up to 90% on input tokens.
Choose the right model
Use Haiku or GPT-4.1-mini for simple tasks (classification, extraction, formatting). Reserve Opus/GPT-4o/o3 for tasks that genuinely need advanced reasoning. The cost difference can be 10-50x.
Set max_tokens appropriately
Always set max_tokens to a reasonable limit for your use case. This caps output costs and prevents runaway generation. It also improves the accuracy of cost pre-deduction.
Monitor with the Generation API
Periodically query /v1/generation to understand your cost breakdown. Look for requests with high reasoning_tokens or low cache hit rates.
Streaming Best Practices
Always include stream_options
Set stream_options: {include_usage: true} to receive token counts in the final chunk. Without this, you cannot track costs client-side.
Handle partial chunks gracefully
SSE chunks can split across network packets. Always buffer and parse complete data: lines. The OpenAI SDK handles this automatically.
Use -N flag with curl
The -N (--no-buffer) flag disables output buffering in curl, so you see tokens as they arrive instead of all at once.
Key Security
Rotate keys regularly
Create new keys periodically and deactivate old ones. Use separate keys for development, staging, and production environments.
Use IP whitelisting
For production keys, restrict access to your server IP addresses or CIDR ranges. This prevents unauthorized usage even if a key is leaked.
Set daily spend caps
Configure daily_limit on each key to prevent runaway costs from bugs or compromised keys. The key returns 403 once the cap is reached.
Restrict allowed models
Use allowed_models to limit which models each key can access. A key meant for embeddings does not need access to GPT-4o or Claude Opus.
Never expose keys in client-side code
API keys should only be used server-side. Never include them in JavaScript bundles, mobile apps, or public repositories. Use environment variables exclusively.