Install and Authenticate
| |
Log in to store your token locally, or export it as an environment variable:
| |
Create a fine-grained token at huggingface.co/settings/tokens with the “Make calls to Inference Providers” permission. The InferenceClient picks up your stored token automatically if you used huggingface-cli login.
Run a Chat Completion
The fastest way to get a response from an open-source LLM:
| |
This follows the OpenAI chat completions format exactly. If you already have code using the OpenAI Python client, switching to Hugging Face requires changing two lines:
| |
The stream=True parameter works the same way – you iterate over chunks and pull delta.content from each one.
Choose an Inference Provider
Hugging Face routes your requests through 15+ backend providers: Together AI, Groq, Cerebras, Fireworks, Replicate, fal-ai, SambaNova, and others. By default, provider="auto" picks the fastest available provider for your model.
You can pin a specific provider:
| |
Or use your own provider API key directly instead of routing through Hugging Face:
| |
When routing through Hugging Face (the default), usage gets billed to your HF account. When passing a provider key directly, you skip the Hugging Face proxy and get billed by the provider.
You can also append a selection policy to the model name: :fastest (default, highest throughput), :cheapest (lowest cost per output token), or :preferred (follows your order in HF settings).
| |
Generate Images from Text
The same client handles image generation. You just call a different method:
| |
The return value is a PIL.Image object. Providers that support text_to_image include fal-ai, Replicate, Together, Nebius, Nscale, HF Inference, and others.
Use Structured Outputs and Tool Calling
InferenceClient supports the same tool-calling interface as OpenAI:
| |
For structured JSON output, pass a response_format with a JSON schema:
| |
Run Async Inference
For high-throughput applications, use AsyncInferenceClient:
| |
The async client has the exact same API surface as the sync version. Every method available on InferenceClient exists on AsyncInferenceClient.
Handle Errors and Timeouts
Three errors you will hit sooner or later:
Rate limit (HTTP 429)
| |
Free-tier users get a few hundred requests per hour. Authenticate with a token to raise the limit. Upgrading to a PRO account ($9/month) gives you $2 of monthly inference credits and higher rate limits.
Model loading (HTTP 503)
| |
Serverless models cold-start when nobody has used them recently. Retry after the estimated time, or pick a popular model that stays warm.
Timeout
Set a timeout to avoid hanging indefinitely on slow models:
| |
Access denied (HTTP 403)
Some models are gated. You need to accept the model’s license on its Hub page before the API will serve it. You will see:
| |
Visit the model page, click “Agree and access,” and retry.
Pricing at a Glance
Hugging Face bills per request based on compute time multiplied by hardware cost. A text_to_image call on FLUX.1-dev that takes 10 seconds on a GPU costing $0.00012/second bills $0.0012. There is no markup on provider rates.
Free-tier users get rate-limited access with no billing. PRO users ($9/month) get $2 of monthly credits. Enterprise orgs can set billing targets per team with the bill_to parameter:
| |
Supported Tasks Beyond Chat
InferenceClient is not just for LLMs. It covers 25+ tasks across NLP, vision, and audio:
text_classification,token_classification,summarization,translationtext_to_image,image_to_image,image_classification,object_detectionautomatic_speech_recognition,text_to_speechfeature_extraction,sentence_similarity,zero_shot_classification
Each task is a method on the client. The HF Inference provider supports all of them. Third-party providers cover a subset – chat_completion has the widest provider support (15+ providers), while tasks like fill_mask or table_question_answering are HF Inference only.
Browse models with inference support at huggingface.co/models?inference=warm.
Common Pitfalls
Using model IDs from the provider instead of Hugging Face. Always pass the Hub model ID (meta-llama/Meta-Llama-3-8B-Instruct), not the provider’s internal ID. The client handles the mapping.
Forgetting to specify a model with third-party providers. When provider="auto", the client can pick a default model. When you pin a provider like together or replicate, you must specify which model to use.
Confusing api_key and token. Both work for passing your HF token. The api_key parameter is an alias added for OpenAI compatibility. Use whichever you prefer.
Not checking provider support for your task. If you try text_to_image on a provider that only supports chat_completion, you will get a routing error. Check the provider compatibility table before wiring things up.
Related Guides
- How to Run Open-Source Models with the Replicate API
- How to Use the Cerebras API for Fast LLM Inference
- How to Run Fast LLM Inference with the Groq API
- How to Build Apps with the Gemini API and Python SDK
- How to Use the Anthropic Token Counting API for Cost Estimation
- How to Use the Anthropic Claude Files API for Large Document Processing
- How to Use the Anthropic PDF Processing API for Document Analysis
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Google Vertex AI Gemini API for Multimodal Tasks
- How to Use the Mistral API for Code Generation and Chat