How to Run Models with the Hugging Face Inference API

Install and Authenticate

1
pip install huggingface_hub

1
2
3
huggingface-cli login
# or
export HF_TOKEN="hf_your_token_here"

Create a fine-grained token at huggingface.co/settings/tokens with the “Make calls to Inference Providers” permission. The InferenceClient picks up your stored token automatically if you used huggingface-cli login.

Run a Chat Completion

The fastest way to get a response from an open-source LLM:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from huggingface_hub import InferenceClient

client = InferenceClient()

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the three laws of thermodynamics?"},
    ],
    max_tokens=512,
)

print(output.choices[0].message.content)

This follows the OpenAI chat completions format exactly. If you already have code using the OpenAI Python client, switching to Hugging Face requires changing two lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Before (OpenAI)
from openai import OpenAI
client = OpenAI()

# After (Hugging Face)
from huggingface_hub import InferenceClient
client = InferenceClient()

# Everything else stays the same
output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Count to 10"}],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content, end="")

The stream=True parameter works the same way – you iterate over chunks and pull delta.content from each one.

Choose an Inference Provider

Hugging Face routes your requests through 15+ backend providers: Together AI, Groq, Cerebras, Fireworks, Replicate, fal-ai, SambaNova, and others. By default, provider="auto" picks the fastest available provider for your model.

You can pin a specific provider:

1
2
3
4
5
6
7
client = InferenceClient(provider="together")

output = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain monads in one paragraph."}],
    max_tokens=256,
)

Or use your own provider API key directly instead of routing through Hugging Face:

1
2
3
4
client = InferenceClient(
    provider="replicate",
    api_key="r8_your_replicate_key",
)

When routing through Hugging Face (the default), usage gets billed to your HF account. When passing a provider key directly, you skip the Hugging Face proxy and get billed by the provider.

You can also append a selection policy to the model name: :fastest (default, highest throughput), :cheapest (lowest cost per output token), or :preferred (follows your order in HF settings).

1
2
3
4
output = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct:cheapest",
    messages=[{"role": "user", "content": "Hello"}],
)

Generate Images from Text

The same client handles image generation. You just call a different method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from huggingface_hub import InferenceClient

client = InferenceClient()

image = client.text_to_image(
    "A flying car crossing a futuristic cityscape at sunset",
    model="black-forest-labs/FLUX.1-schnell",
)

image.save("flying_car.png")

The return value is a PIL.Image object. Providers that support text_to_image include fal-ai, Replicate, Together, Nebius, Nscale, HF Inference, and others.

Use Structured Outputs and Tool Calling

InferenceClient supports the same tool-calling interface as OpenAI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from huggingface_hub import InferenceClient

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country, e.g. Paris, France",
                    }
                },
                "required": ["location"],
            },
        },
    }
]

client = InferenceClient(provider="nebius")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools,
    tool_choice="auto",
)

print(response.choices[0].message.tool_calls[0].function.arguments)

For structured JSON output, pass a response_format with a JSON schema:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
client = InferenceClient(provider="cerebras")

result = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "system", "content": "Extract book information."},
        {"role": "user", "content": "I just read 'Dune' by Frank Herbert."},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "book",
            "schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "author": {"type": "string"},
                },
                "required": ["title", "author"],
            },
            "strict": True,
        },
    },
)

print(result.choices[0].message.content)
# {"title": "Dune", "author": "Frank Herbert"}

Run Async Inference

For high-throughput applications, use AsyncInferenceClient:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import asyncio
from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient()

async def main():
    stream = await client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[{"role": "user", "content": "Explain quicksort briefly."}],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")

asyncio.run(main())

The async client has the exact same API surface as the sync version. Every method available on InferenceClient exists on AsyncInferenceClient.

Handle Errors and Timeouts

Three errors you will hit sooner or later:

Rate limit (HTTP 429)

1
2
HfHubHTTPError: 429 Client Error: Too Many Requests
Rate limit reached. You reached free usage limit (reset hourly).

Free-tier users get a few hundred requests per hour. Authenticate with a token to raise the limit. Upgrading to a PRO account ($9/month) gives you $2 of monthly inference credits and higher rate limits.

Model loading (HTTP 503)

1
2
HfHubHTTPError: 503 Server Error: Service Unavailable
Model is currently loading. Estimated time: 20s.

Serverless models cold-start when nobody has used them recently. Retry after the estimated time, or pick a popular model that stays warm.

Timeout

Set a timeout to avoid hanging indefinitely on slow models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from huggingface_hub import InferenceClient, InferenceTimeoutError

client = InferenceClient(timeout=30)

try:
    output = client.text_to_image(
        "A cat on the moon",
        model="black-forest-labs/FLUX.1-schnell",
    )
except InferenceTimeoutError:
    print("Request timed out after 30 seconds.")

Access denied (HTTP 403)

Some models are gated. You need to accept the model’s license on its Hub page before the API will serve it. You will see:

1
HfHubHTTPError: 403 Client Error: Forbidden

Visit the model page, click “Agree and access,” and retry.

Pricing at a Glance

Hugging Face bills per request based on compute time multiplied by hardware cost. A text_to_image call on FLUX.1-dev that takes 10 seconds on a GPU costing $0.00012/second bills $0.0012. There is no markup on provider rates.

Free-tier users get rate-limited access with no billing. PRO users ($9/month) get $2 of monthly credits. Enterprise orgs can set billing targets per team with the bill_to parameter:

1
client = InferenceClient(provider="fal-ai", bill_to="my-org")

Supported Tasks Beyond Chat

InferenceClient is not just for LLMs. It covers 25+ tasks across NLP, vision, and audio:

text_classification, token_classification, summarization, translation
text_to_image, image_to_image, image_classification, object_detection
automatic_speech_recognition, text_to_speech
feature_extraction, sentence_similarity, zero_shot_classification

Each task is a method on the client. The HF Inference provider supports all of them. Third-party providers cover a subset – chat_completion has the widest provider support (15+ providers), while tasks like fill_mask or table_question_answering are HF Inference only.

Browse models with inference support at huggingface.co/models?inference=warm.

Common Pitfalls

Using model IDs from the provider instead of Hugging Face. Always pass the Hub model ID (meta-llama/Meta-Llama-3-8B-Instruct), not the provider’s internal ID. The client handles the mapping.

Forgetting to specify a model with third-party providers. When provider="auto", the client can pick a default model. When you pin a provider like together or replicate, you must specify which model to use.

Confusing api_key and token. Both work for passing your HF token. The api_key parameter is an alias added for OpenAI compatibility. Use whichever you prefer.

Not checking provider support for your task. If you try text_to_image on a provider that only supports chat_completion, you will get a routing error. Check the provider compatibility table before wiring things up.

Install and Authenticate#

Run a Chat Completion#

Choose an Inference Provider#

Generate Images from Text#

Use Structured Outputs and Tool Calling#

Run Async Inference#

Handle Errors and Timeouts#

Pricing at a Glance#

Supported Tasks Beyond Chat#

Common Pitfalls#

Related Guides#

About the Author