AI Engineering 10 min read

Sentence Transformers in Production: SBERT, Model Selection, Mean Pooling, and Domain Adaptation

Why SBERT exists (BERT [CLS] fails for similarity), mean pooling implementation, cosine vs. dot product equivalence after normalization, model selection table, and domain adaptation with MultipleNegativesRankingLoss.

Why Sentence-BERT Exists

BERT's [CLS] token produces poor sentence embeddings for semantic similarity. Computing similarity between N sentences with vanilla BERT requires N*(N-1)/2 forward passes. For 10,000 sentences that's 50 million forward passes. Sentence-BERT (SBERT) solved both problems in a single paper — by training a siamese/triplet network structure that produces semantically meaningful sentence embeddings in a single forward pass.

The SBERT Training Architecture

SBERT uses a siamese network structure during training. Two identical copies of BERT (shared weights) process sentence pairs. The outputs are pooled into fixed-size vectors. A similarity objective — typically cross-entropy on Natural Language Inference labels (entailment=similar, contradiction=dissimilar) or a softmax loss over (anchor, positive, negative) triplets — trains the network to produce embeddings where semantically similar sentences are close in vector space.

# SBERT training pseudocode (siamese NLI)
sentence_a, sentence_b, label = batch  # label: 0=contradiction, 1=neutral, 2=entailment

emb_a = mean_pool(bert(sentence_a))  # [batch, 768]
emb_b = mean_pool(bert(sentence_b))  # [batch, 768]

# Concatenate with element-wise difference
combined = concat([emb_a, emb_b, abs(emb_a - emb_b)])  # [batch, 2304]
loss = cross_entropy(linear(combined), label)

# At inference: just one BERT forward pass per sentence
emb = mean_pool(bert(sentence))  # [768] — done

Mean Pooling vs CLS Pooling

Mean pooling averages all token embeddings (excluding padding). CLS pooling takes only the [CLS] token embedding. For SBERT models, mean pooling consistently outperforms CLS pooling. The intuition: mean pooling captures information distributed across all tokens; CLS pooling relies on one token to aggregate everything, which doesn't generalize well without fine-tuning specifically on that task.

# Mean pooling implementation
def mean_pool(token_embeddings, attention_mask):
    # token_embeddings: [batch, seq_len, 768]
    # attention_mask:   [batch, seq_len] — 1 for real tokens, 0 for padding
    mask_expanded = attention_mask.unsqueeze(-1).float()  # [batch, seq_len, 1]
    sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1)  # [batch, 768]
    sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9)             # [batch, 1]
    return sum_embeddings / sum_mask                                  # [batch, 768]

Cosine Similarity vs Dot Product

Cosine similarity normalizes by vector magnitude — it measures the angle between vectors, ignoring scale. Dot product does not normalize. For SBERT embeddings, both give identical rankings IF embeddings are L2-normalized (which they typically are after mean pooling + normalization). Use dot product when your ANN index (FAISS, Weaviate, Qdrant) is optimized for it — it's faster than computing cosine similarity explicitly.

Normalize your embeddings. After mean pooling, apply L2 normalization: emb = emb / emb.norm(). This makes cosine similarity equivalent to dot product, letting you use the faster FAISS IndexFlatIP (inner product) index instead of the slower cosine version.

Choosing a Sentence Transformer Model

Domain Adaptation

Pre-trained sentence transformers are trained on general-domain data (Wikipedia, news, NLI datasets). For specialized domains (medical, legal, code, financial), in-domain fine-tuning significantly improves retrieval quality. You need (query, positive document) pairs — often mined from click logs, expert annotations, or synthetic generation with an LLM.

Common approach: start from all-mpnet-base-v2, fine-tune with MultipleNegativesRankingLoss on your domain pairs. With ~10,000 pairs, you can get substantial improvement. With ~100,000 pairs, you approach purpose-built domain models. This is called 'domain adaptation' and is underutilized in production.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →