The Quick Setup

Install the Pinecone SDK and an embedding provider. You need both – Pinecone stores and queries vectors, but you generate them yourself.

1
pip install pinecone openai sentence-transformers

Create a serverless index and upsert your first vectors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(
    name="docs-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

index = pc.Index("docs-index")

index.upsert(
    vectors=[
        {"id": "doc-1", "values": [0.1] * 1536, "metadata": {"source": "wiki", "topic": "ml"}},
        {"id": "doc-2", "values": [0.2] * 1536, "metadata": {"source": "blog", "topic": "nlp"}},
    ]
)

That’s the core loop. Everything else is about doing it well at scale.

Generating Embeddings

You have two main options: OpenAI’s API or local models with Sentence Transformers. OpenAI gives you better quality on general text. Sentence Transformers is free and runs locally.

OpenAI Embeddings

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

vectors = embed_texts(["Pinecone is a vector database", "Embeddings capture meaning"])
# Each vector has 1536 dimensions by default

text-embedding-3-small is the sweet spot. It’s cheap ($0.02 per million tokens), fast, and the quality gap vs. text-embedding-3-large is smaller than you’d expect for most retrieval tasks.

Sentence Transformers (Local)

1
2
3
4
5
6
7
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = ["Pinecone is a vector database", "Embeddings capture meaning"]
vectors = model.encode(texts).tolist()
# Each vector has 384 dimensions

Pick all-MiniLM-L6-v2 for speed (384 dims), or all-mpnet-base-v2 for better accuracy (768 dims). Match your index dimension to whichever model you choose.

Serverless vs. Pod-Based Indexes

Serverless is what you want for most new projects. It auto-scales, you pay per query, and there’s no idle cost eating your budget.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Serverless — recommended for most use cases
pc.create_index(
    name="serverless-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

# Pod-based — use when you need predictable latency or very high throughput
from pinecone import PodSpec

pc.create_index(
    name="pod-index",
    dimension=1536,
    metric="cosine",
    spec=PodSpec(environment="us-east-1-aws", pod_type="p1.x1", pods=1),
)

Pod-based indexes make sense when you have steady, high-throughput workloads where you want guaranteed latency. For everything else, serverless saves money.

Building a Semantic Search Pipeline

Here’s a real pipeline that embeds documents, upserts them in batches, and queries with metadata filtering.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
import itertools

pc = Pinecone(api_key="PINECONE_API_KEY")
openai_client = OpenAI()

# Create the index
if "search-index" not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name="search-index",
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index("search-index")

# Your documents
documents = [
    {"id": "doc-1", "text": "PyTorch supports dynamic computation graphs", "category": "frameworks"},
    {"id": "doc-2", "text": "RLHF aligns language models with human preferences", "category": "training"},
    {"id": "doc-3", "text": "LoRA reduces fine-tuning memory by 10x", "category": "training"},
    {"id": "doc-4", "text": "vLLM serves LLMs with PagedAttention", "category": "inference"},
]

# Embed all texts
texts = [doc["text"] for doc in documents]
response = openai_client.embeddings.create(model="text-embedding-3-small", input=texts)
embeddings = [item.embedding for item in response.data]

# Build vectors with metadata
vectors = [
    {
        "id": doc["id"],
        "values": emb,
        "metadata": {"text": doc["text"], "category": doc["category"]},
    }
    for doc, emb in zip(documents, embeddings)
]

# Batch upsert (100 vectors per batch is a good default)
def chunked(iterable, size):
    it = iter(iterable)
    while chunk := list(itertools.islice(it, size)):
        yield chunk

for batch in chunked(vectors, 100):
    index.upsert(vectors=batch)

# Query with metadata filter
query_text = "How do I fine-tune with less memory?"
query_embedding = openai_client.embeddings.create(
    model="text-embedding-3-small", input=[query_text]
).data[0].embedding

results = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True,
    filter={"category": {"$eq": "training"}},
)

for match in results.matches:
    print(f"{match.id}: {match.score:.4f}{match.metadata['text']}")

The metadata filter narrows results to the training category before ranking by similarity. This is faster than filtering after retrieval and gives you more relevant results.

Namespace Management

Namespaces partition your index without creating separate indexes. Use them to isolate tenants, environments, or data types.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Upsert into a specific namespace
index.upsert(vectors=vectors, namespace="production")

# Query within a namespace
results = index.query(
    vector=query_embedding,
    top_k=5,
    namespace="production",
    include_metadata=True,
)

# Delete all vectors in a namespace
index.delete(delete_all=True, namespace="staging")

One index with namespaces beats multiple indexes for multi-tenant apps. You pay for one index but get logical separation. Each namespace has its own vector space, so queries in one namespace never see vectors from another.

Index Management

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# List all indexes
for idx in pc.list_indexes():
    print(f"{idx.name}{idx.dimension}d, {idx.metric}, status: {idx.status.state}")

# Describe an index (get stats)
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Namespaces: {stats.namespaces}")

# Delete vectors by ID
index.delete(ids=["doc-1", "doc-2"])

# Delete vectors by metadata filter
index.delete(filter={"category": {"$eq": "deprecated"}})

# Delete the entire index
pc.delete_index("search-index")

Hybrid Search with Sparse-Dense Vectors

Pure semantic search misses exact keyword matches. Hybrid search combines dense embeddings (semantic meaning) with sparse vectors (keyword matching) for better retrieval.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Hybrid search requires a sparse-dense index
# Use dotproduct metric for hybrid queries
pc.create_index(
    name="hybrid-index",
    dimension=1536,
    metric="dotproduct",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

hybrid_index = pc.Index("hybrid-index")

# Sparse vector from BM25 or SPLADE (simplified example)
# In production, use pinecone-text or a SPLADE model
sparse_values = {"indices": [102, 304, 512], "values": [0.8, 0.6, 0.3]}

hybrid_index.upsert(
    vectors=[
        {
            "id": "doc-1",
            "values": [0.1] * 1536,  # dense embedding
            "sparse_values": sparse_values,
            "metadata": {"text": "LoRA fine-tuning guide"},
        }
    ]
)

# Query with both dense and sparse components
results = hybrid_index.query(
    vector=[0.1] * 1536,
    sparse_vector={"indices": [102, 512], "values": [0.9, 0.4]},
    top_k=10,
    include_metadata=True,
)

Use dotproduct as the metric for hybrid indexes. Cosine similarity doesn’t work correctly when combining sparse and dense scores.

Cost Optimization Tips

Pinecone billing adds up fast if you’re not paying attention. A few things that actually matter:

Choose smaller embeddings. text-embedding-3-small at 1536 dims costs less to store and query than text-embedding-3-large at 3072 dims. The retrieval quality difference is marginal for most use cases.

Batch your upserts. Single-vector upserts are wasteful. Always batch at 100 vectors per request. The API accepts up to 1000, but 100 is the sweet spot for reliability.

Use metadata filtering aggressively. Filters reduce the search space before the vector comparison happens. A query over 100k vectors with a filter that narrows to 10k is significantly cheaper on serverless.

Delete what you don’t need. Serverless indexes charge per vector stored. If you’re replacing embeddings (re-indexing after a model change), delete the old namespace first.

Pick serverless for bursty workloads. If your traffic is uneven – heavy during business hours, quiet at night – serverless saves 60-80% compared to always-on pods.

Common Errors

PineconeApiException: dimension mismatch – Your vectors don’t match the index dimension. If the index is 1536d, every vector must have exactly 1536 floats. Check which embedding model you’re using. all-MiniLM-L6-v2 outputs 384d, not 1536d.

PineconeApiException: index not found – The index might still be initializing. After create_index, wait for it to be ready:

1
2
3
4
import time

while not pc.describe_index("my-index").status.ready:
    time.sleep(1)

429 Too Many Requests – You’re hitting rate limits. Add exponential backoff to your upsert loop, or reduce batch size. Serverless indexes have lower burst limits than pod-based ones.

metadata size exceeds limit – Pinecone caps metadata at 40KB per vector. Don’t store full document text in metadata. Store a reference ID and fetch the text from your own database.

Slow queries returning irrelevant results – Your embedding model and query model are probably different. Always use the same model for embedding documents and queries. Mixing text-embedding-ada-002 documents with text-embedding-3-small queries gives garbage results.