GenAI Systems Lab Open interactive version →
Production & LLMOps 10 min read

How Spotify Uses AI: Embeddings, DJ, and the Recommendation Stack

From music2vec embeddings to AI-generated narration. How Spotify's recommendation system evolved from collaborative filtering to multimodal embeddings, and what LLMs actually power in production.

Spotify has been doing AI at scale longer than most companies have been doing AI at all. Their recommendation system predates the LLM era by a decade. Understanding how they've layered LLM capabilities on top of a mature ML platform reveals how production AI actually evolves — not as a replacement for existing systems, but as a new layer on top of them.

The recommendation stack before LLMs

Spotify's recommendation system is a multi-stage ranking pipeline that combines three signal types:

In 2018, Spotify published word2vec-based music embeddings (music2vec) — trained not on text but on listening sessions. A track that appears next to another in many sessions gets a similar embedding. This was semantic search applied to music before the term became mainstream.

Embeddings trained on behavioral sequences (listening sessions, clicks, purchases) often outperform embeddings trained on content features alone. The sequence is the signal. This is why word2vec's architecture generalizes beyond text to product recommendations, music, and any sequential interaction data.

Where LLMs actually appear in Spotify's stack

LLMs don't run the recommendation engine. They run the narrative layer on top of it. The two main use cases are:

1. AI DJ: generated narration between tracks

Spotify's DJ feature (launched 2023) plays music interspersed with short 15-30 second voice narrations. 'You've been on a real 90s indie kick lately. Here's a deep cut from Pavement.' These feel personal. They're generated by an LLM.

The pipeline:

The LLM's job is specific: write a 2-3 sentence bridge that feels natural, references the user's history, and introduces the next track. The system prompt includes rules about tone, length, and what not to say (no artist biographies, no Wikipedia-style facts that feel robotic).

2. Podcast transcription and search

Spotify has 5+ million podcasts. Most have no searchable text. Whisper (or an equivalent ASR model) transcribes them at scale. These transcripts enable:

At Spotify's scale (new podcast episodes every minute), this is a streaming pipeline: new audio → ASR → chunk → embed → index. Latency from publication to searchability is measured in minutes.

Multimodal embeddings: combining audio, text, and behavior

Spotify's current embedding research fuses three modalities per track:

These are fused into a single embedding using a contrastive training approach similar to CLIP — but instead of image-text pairs, the pairs are audio-behavior sequences. A track's audio embedding should be close to its behavioral embedding if they describe the same musical identity.

Multi-modal fusion is most powerful when the modalities contain complementary information. Audio tells you what the music sounds like. Behavior tells you who likes it. Text tells you how people talk about it. None of these alone is as powerful as the three combined.

Lessons from Spotify's AI stack

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →