GenAI Systems Lab Open interactive version →
AI Engineering 12 min read

The AI Vocabulary Cheat Sheet: 80 Terms You Need to Know Cold

From attention to zero-shot — the 80 terms that come up in AI interviews, product reviews, and engineering discussions. Definitions plus context for each.

These are the 80 terms you'll encounter in AI engineering interviews, design docs, and production conversations. Definitions are kept practical — what the term means in context, not a textbook definition.

Foundations

TermWhat it means in practice
TokenThe atomic unit an LLM processes — roughly 4 chars in English. Costs are per-token.
Context windowThe maximum number of tokens a model can 'see' at once — prompt + response combined.
EmbeddingA fixed-length float vector representing the meaning of text. Similar texts have nearby vectors.
TemperatureControls randomness at generation. 0 = deterministic. 1 = default. >1 = more random.
LogitsRaw scores output by the model before softmax. Sampling operates on these.
Top-K / Top-PSampling limits: Top-K restricts to K most likely tokens; Top-P uses probability mass threshold.
Greedy decodingAlways pick the highest-probability next token. Deterministic but prone to repetition.
Beam searchExplore K sequences in parallel, keep best. Slower than greedy, better quality.
PerplexityHow 'surprised' the model is by text. Lower perplexity = model finds text more probable.
BPEByte Pair Encoding — the tokenisation algorithm used by GPT/Claude. Merges common char pairs.

Architecture

TermWhat it means in practice
TransformerThe architecture underlying all major LLMs. Key components: attention, FFN, residuals.
Self-attentionMechanism where each token attends to all others. Core of the transformer.
Multi-head attentionRunning attention in parallel across multiple subspaces, then concatenating.
QKVQuery, Key, Value — the three learned projections in each attention head.
Positional encodingTells the model where each token is in the sequence (transformers have no inherent order).
Residual connectionSkip connection that adds input to output of a layer. Prevents vanishing gradients.
Layer normNormalises activations across the hidden dimension. Stabilises training.
FFNFeed-forward network. Applies a 2-layer MLP to each token position independently.
MoEMixture of Experts — only activates a subset of model parameters per token. Used in GPT-4.
Decoder-onlyArchitecture where each token can only attend to previous tokens. Used by GPT, Claude, Llama.

Training

TermWhat it means in practice
PretrainingInitial training on massive text corpus to predict next token. Creates the base model.
Fine-tuningContinued training on a smaller, task-specific dataset to specialise behaviour.
SFTSupervised Fine-Tuning — training on (prompt, ideal response) pairs.
RLHFReinforcement Learning from Human Feedback — ranks responses, trains a reward model, then RL.
PPOProximal Policy Optimisation — the RL algorithm typically used in RLHF.
DPODirect Preference Optimisation — trains on preference pairs directly, simpler than RLHF.
LoRALow-Rank Adaptation — fine-tunes only a small set of added weight matrices. Efficient.
QLoRAQuantised LoRA — LoRA on a quantised (4-bit) model. Fine-tune 65B model on a consumer GPU.
Constitutional AIAnthropic's technique: model critiques its own outputs against a set of principles.
RLAIFRL from AI Feedback — uses a strong LLM as the feedback model instead of humans.

RAG & Retrieval

TermWhat it means in practice
RAGRetrieval-Augmented Generation — retrieve relevant docs, include in prompt, generate.
Vector storeDatabase optimised for approximate nearest-neighbour search over embedding vectors.
Semantic searchSearch by meaning (embeddings + cosine similarity) rather than keyword match.
BM25Classic keyword-based ranking algorithm. Strong baseline, complementary to semantic search.
Hybrid searchCombining BM25 and vector search scores. Usually beats either alone.
RerankerCross-encoder model that re-scores top-K retrieved candidates. Expensive but accurate.
ChunkingSplitting documents into retrievable pieces. Strategy heavily affects RAG quality.
HyDEHypothetical Document Embeddings — embed a generated answer, not the query, for retrieval.
Contextual retrievalAnthropic technique: add context about each chunk's document before embedding.
MMRMaximal Marginal Relevance — selects diverse retrieved chunks, not just most similar.

Agents & Tools

TermWhat it means in practice
AgentLLM in a loop — takes actions, observes results, decides next step.
ReActReason + Act — model alternates between reasoning steps and tool calls.
Tool use / function callingStructured way to let an LLM invoke external functions with typed arguments.
MCPModel Context Protocol — Anthropic's standard for connecting LLMs to external tools.
Agentic loopThe observe → think → act cycle that drives agent execution.
OrchestratorA top-level agent or controller that delegates to sub-agents.
Memory (episodic)Log of what happened in previous turns or sessions. Retrieved for context.
Memory (semantic)Long-term facts about the user or world. Stored in a vector store.
ToTTree of Thoughts — explore multiple reasoning paths, backtrack on dead ends.
LATSLanguage Agent Tree Search — combines ToT with MCTS for complex planning.

Evaluation & Safety

TermWhat it means in practice
HallucinationModel confidently states false information not supported by its context or training.
FaithfulnessWhether a generated answer is grounded in the provided source material.
LLM-as-judgeUsing a strong LLM to score outputs against a rubric. Scalable alternative to human eval.
RAGASRAG evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall.
Prompt injectionAttack where instructions in data (not the system prompt) hijack model behaviour.
JailbreakSocial-engineering technique to bypass model safety guidelines.
GuardrailsInput/output filters that enforce safety policies at the application layer.
Red teamingAdversarial probing of a model system to find safety failures before users do.
AlignmentResearch area: making models behave consistently with human values and intentions.
EvalsEvaluation suite — a set of (input, expected) pairs + judges that measure system quality.

Production / LLMOps

TermWhat it means in practice
Prompt cachingReusing KV cache for repeated prompt prefixes. Saves 80–90% cost on cached tokens.
TTFTTime to First Token — latency until the first output token arrives. Key UX metric.
Speculative decodingDraft model proposes tokens; main model verifies. Speeds inference 2–3×.
QuantisationReducing model weight precision (FP16 → INT8 → INT4). Trades accuracy for speed/memory.
vLLMHigh-throughput LLM serving framework. Uses PagedAttention for efficient KV cache.
Prompt versioningTreating prompts as code: version control, staging, evals before promotion.
TraceFull record of an LLM call: inputs, outputs, latency, tokens, cost. Essential for debugging.
SpanSingle step within a trace (one LLM call, one tool call, one retrieval).
Model routingDirecting requests to different models based on complexity, cost, or latency needs.
Shadow modeRunning a new model/prompt in parallel with production, comparing outputs without serving results.

Test your AI vocabulary →: Flashcard-style fluency drills in the Fluency module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →