Sentence Transformers in Production: SBERT, Model Selection, Mean Pooling, and Domain Adaptation
Why SBERT exists (BERT [CLS] fails for similarity), mean pooling implementation, cosine vs. dot product equivalence after normalization, model selection table, and domain adaptation with MultipleNegativesRankingLoss.
Why Sentence-BERT Exists
BERT's [CLS] token produces poor sentence embeddings for semantic similarity. Computing similarity between N sentences with vanilla BERT requires N*(N-1)/2 forward passes. For 10,000 sentences that's 50 million forward passes. Sentence-BERT (SBERT) solved both problems in a single paper — by training a siamese/triplet network structure that produces semantically meaningful sentence embeddings in a single forward pass.
The SBERT Training Architecture
SBERT uses a siamese network structure during training. Two identical copies of BERT (shared weights) process sentence pairs. The outputs are pooled into fixed-size vectors. A similarity objective — typically cross-entropy on Natural Language Inference labels (entailment=similar, contradiction=dissimilar) or a softmax loss over (anchor, positive, negative) triplets — trains the network to produce embeddings where semantically similar sentences are close in vector space.
# SBERT training pseudocode (siamese NLI)
sentence_a, sentence_b, label = batch # label: 0=contradiction, 1=neutral, 2=entailment
emb_a = mean_pool(bert(sentence_a)) # [batch, 768]
emb_b = mean_pool(bert(sentence_b)) # [batch, 768]
# Concatenate with element-wise difference
combined = concat([emb_a, emb_b, abs(emb_a - emb_b)]) # [batch, 2304]
loss = cross_entropy(linear(combined), label)
# At inference: just one BERT forward pass per sentence
emb = mean_pool(bert(sentence)) # [768] — done
Mean Pooling vs CLS Pooling
Mean pooling averages all token embeddings (excluding padding). CLS pooling takes only the [CLS] token embedding. For SBERT models, mean pooling consistently outperforms CLS pooling. The intuition: mean pooling captures information distributed across all tokens; CLS pooling relies on one token to aggregate everything, which doesn't generalize well without fine-tuning specifically on that task.
# Mean pooling implementation
def mean_pool(token_embeddings, attention_mask):
# token_embeddings: [batch, seq_len, 768]
# attention_mask: [batch, seq_len] — 1 for real tokens, 0 for padding
mask_expanded = attention_mask.unsqueeze(-1).float() # [batch, seq_len, 1]
sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1) # [batch, 768]
sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9) # [batch, 1]
return sum_embeddings / sum_mask # [batch, 768]
Cosine Similarity vs Dot Product
Cosine similarity normalizes by vector magnitude — it measures the angle between vectors, ignoring scale. Dot product does not normalize. For SBERT embeddings, both give identical rankings IF embeddings are L2-normalized (which they typically are after mean pooling + normalization). Use dot product when your ANN index (FAISS, Weaviate, Qdrant) is optimized for it — it's faster than computing cosine similarity explicitly.
Normalize your embeddings. After mean pooling, apply L2 normalization: emb = emb / emb.norm(). This makes cosine similarity equivalent to dot product, letting you use the faster FAISS IndexFlatIP (inner product) index instead of the slower cosine version.
Choosing a Sentence Transformer Model
Domain Adaptation
Pre-trained sentence transformers are trained on general-domain data (Wikipedia, news, NLI datasets). For specialized domains (medical, legal, code, financial), in-domain fine-tuning significantly improves retrieval quality. You need (query, positive document) pairs — often mined from click logs, expert annotations, or synthetic generation with an LLM.
Common approach: start from all-mpnet-base-v2, fine-tune with MultipleNegativesRankingLoss on your domain pairs. With ~10,000 pairs, you can get substantial improvement. With ~100,000 pairs, you approach purpose-built domain models. This is called 'domain adaptation' and is underutilized in production.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →