Production & LLMOps 10 min read

How Spotify Uses AI: Embeddings, DJ, and the Recommendation Stack

From music2vec embeddings to AI-generated narration. How Spotify's recommendation system evolved from collaborative filtering to multimodal embeddings, and what LLMs actually power in production.

Spotify has been doing AI at scale longer than most companies have been doing AI at all. Their recommendation system predates the LLM era by a decade. Understanding how they've layered LLM capabilities on top of a mature ML platform reveals how production AI actually evolves — not as a replacement for existing systems, but as a new layer on top of them.

The recommendation stack before LLMs

Spotify's recommendation system is a multi-stage ranking pipeline that combines three signal types:

Collaborative filtering: 'users like you also liked this.' The classic approach — find users with similar listening history and recommend what they liked. Works well for mainstream content, breaks for new artists and niche genres.
Audio features: Spotify's audio analysis models extract features from every track — tempo, energy, valence (mood positivity), acousticness, danceability. These are used for mood-based recommendations and radio.
NLP on text: Spotify crawls the web for blog posts, playlists, and reviews that mention artists and tracks. These are used to build semantic representations of artists — their genre, associations, audience.

In 2018, Spotify published word2vec-based music embeddings (music2vec) — trained not on text but on listening sessions. A track that appears next to another in many sessions gets a similar embedding. This was semantic search applied to music before the term became mainstream.

Embeddings trained on behavioral sequences (listening sessions, clicks, purchases) often outperform embeddings trained on content features alone. The sequence is the signal. This is why word2vec's architecture generalizes beyond text to product recommendations, music, and any sequential interaction data.

Where LLMs actually appear in Spotify's stack

LLMs don't run the recommendation engine. They run the narrative layer on top of it. The two main use cases are:

1. AI DJ: generated narration between tracks

Spotify's DJ feature (launched 2023) plays music interspersed with short 15-30 second voice narrations. 'You've been on a real 90s indie kick lately. Here's a deep cut from Pavement.' These feel personal. They're generated by an LLM.

The pipeline:

The recommendation model selects the next track based on listening history
A context assembly step gathers: the selected track, the artist, recent listening history, time of day, the previous narration
An LLM generates a short narration that connects the previous section to the new track
A text-to-speech model (trained on a specific voice) narrates it with Spotify's DJ voice
Audio is stitched: previous track fade-out + narration + new track intro

The LLM's job is specific: write a 2-3 sentence bridge that feels natural, references the user's history, and introduces the next track. The system prompt includes rules about tone, length, and what not to say (no artist biographies, no Wikipedia-style facts that feel robotic).

2. Podcast transcription and search

Spotify has 5+ million podcasts. Most have no searchable text. Whisper (or an equivalent ASR model) transcribes them at scale. These transcripts enable:

Full-text search within episodes — 'find the moment they talked about Python packaging'
Semantic search across episodes and shows
Chapter detection — segmenting long episodes into topics for navigation
Automated summaries for show pages

At Spotify's scale (new podcast episodes every minute), this is a streaming pipeline: new audio → ASR → chunk → embed → index. Latency from publication to searchability is measured in minutes.

Multimodal embeddings: combining audio, text, and behavior

Spotify's current embedding research fuses three modalities per track:

Audio features: from the raw audio signal — spectral analysis, tempo, dynamics
Text: lyrics + web mentions + artist metadata, processed with a text encoder
Behavioral: from listening session sequences — what gets played together, what follows what

These are fused into a single embedding using a contrastive training approach similar to CLIP — but instead of image-text pairs, the pairs are audio-behavior sequences. A track's audio embedding should be close to its behavioral embedding if they describe the same musical identity.

Multi-modal fusion is most powerful when the modalities contain complementary information. Audio tells you what the music sounds like. Behavior tells you who likes it. Text tells you how people talk about it. None of these alone is as powerful as the three combined.

Lessons from Spotify's AI stack

LLMs are a narrative layer, not the recommendation engine. At Spotify's scale, collaborative filtering + audio embeddings outperform any prompt-based approach for what to play next. LLMs add value in how to present it.
Behavioral embeddings (trained on sequences) are often more powerful than content embeddings for recommendation. The sequence is the signal.
Streaming transcription pipelines need to handle scale continuously, not in batches. Design for throughput from the start.
Multi-modal fusion requires each modality to carry genuinely complementary information — otherwise you're adding noise.

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →