Similarity Search
The core operation: given a query, find the most similar items.
import numpy as np
from numpy.linalg import norm
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (norm(a) * norm(b))
# Compare similarity
query = model.encode("machine learning algorithms")
documents = [
model.encode("supervised learning classification methods"),
model.encode("best pizza delivery services"),
model.encode("neural network training techniques"),
]
for i, doc in enumerate(documents):
sim = cosine_similarity(query, doc)
print(f"Document {i}: {sim:.4f}")
# Output:
# Document 0: 0.7823 (related!)
# Document 1: 0.1234 (not related)
# Document 2: 0.8156 (very related!)
Vector Databases
For production use, you need a vector database that handles billions of vectors efficiently.
Pinecone Example
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
index = pc.Index("my-index")
# Upsert vectors
vectors = [
{"id": "doc1", "values": embedding1, "metadata": {"title": "ML Basics"}},
{"id": "doc2", "values": embedding2, "metadata": {"title": "Pizza Guide"}},
]
index.upsert(vectors=vectors)
# Query
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
Supabase pgvector (Free Option)
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a table with a vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
);
-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Machine learning basics', '[0.1, 0.2, ...]');
-- Similarity search
SELECT content, 1 - (embedding '[0.15, 0.25, ...]') AS similarity
FROM documents
ORDER BY embedding '[0.15, 0.25, ...]'
LIMIT 5;
Practical Applications
1. Semantic Search Engine
Replace keyword search with meaning-based search. Users can search “how to fix slow website” and find documents about “performance optimization” even if those words don’t appear.
2. Duplicate Detection
Find near-duplicate content by comparing embedding similarity. Useful for content moderation, plagiarism detection, and deduplication.
3. Recommendation Systems
Embed user preferences and item descriptions. Recommend items whose embeddings are closest to the user’s preference vector.
4. Clustering and Classification
Group similar items automatically using k-means or HDBSCAN on embeddings. No labels needed — the structure emerges from the data.
Choosing an Embedding Model
- OpenAI text-embedding-3-large: Best quality, $0.13/million tokens
- OpenAI text-embedding-3-small: Good quality, $0.02/million tokens
- Cohere embed-v4: Competitive quality, good for multilingual
- all-MiniLM-L6-v2: Free, runs locally, 384 dimensions, great for prototyping
- BGE-large-en-v1.5: Free, runs locally, 1024 dimensions, production-ready quality
People Also Ask
How many dimensions should embeddings have?
More dimensions capture more nuance but cost more to store and search. For most applications, 768-1536 dimensions is the sweet spot. 384 is fine for prototyping.
Can I use embeddings for images?
Yes — CLIP and similar models create embeddings for both text and images in the same vector space. You can search for images using text queries and vice versa.
How much does vector storage cost?
Pinecone: ~$0.33/million vectors/month. Supabase pgvector: included in Supabase pricing. Self-hosted Qdrant or Chroma: just your server costs.
Want to skip months of trial and error? We’ve distilled thousands of hours of prompt engineering into ready-to-use prompt packs that deliver results on day one. Our packs at wowhow.cloud include battle-tested prompts for marketing, coding, business, writing, and more — each one refined until it consistently produces professional-grade output.
Blog reader exclusive: Use code BLOGREADER20 for 20% off your entire cart. No minimum, no catch.
Browse Prompt Packs →
Comments · 0
No comments yet. Be the first to share your thoughts.