AI Engineering 12 min read

The AI Vocabulary Cheat Sheet: 80 Terms You Need to Know Cold

From attention to zero-shot — the 80 terms that come up in AI interviews, product reviews, and engineering discussions. Definitions plus context for each.

These are the 80 terms you'll encounter in AI engineering interviews, design docs, and production conversations. Definitions are kept practical — what the term means in context, not a textbook definition.

Foundations

Term	What it means in practice
Token	The atomic unit an LLM processes — roughly 4 chars in English. Costs are per-token.
Context window	The maximum number of tokens a model can 'see' at once — prompt + response combined.
Embedding	A fixed-length float vector representing the meaning of text. Similar texts have nearby vectors.
Temperature	Controls randomness at generation. 0 = deterministic. 1 = default. >1 = more random.
Logits	Raw scores output by the model before softmax. Sampling operates on these.
Top-K / Top-P	Sampling limits: Top-K restricts to K most likely tokens; Top-P uses probability mass threshold.
Greedy decoding	Always pick the highest-probability next token. Deterministic but prone to repetition.
Beam search	Explore K sequences in parallel, keep best. Slower than greedy, better quality.
Perplexity	How 'surprised' the model is by text. Lower perplexity = model finds text more probable.
BPE	Byte Pair Encoding — the tokenisation algorithm used by GPT/Claude. Merges common char pairs.

Architecture

Term	What it means in practice
Transformer	The architecture underlying all major LLMs. Key components: attention, FFN, residuals.
Self-attention	Mechanism where each token attends to all others. Core of the transformer.
Multi-head attention	Running attention in parallel across multiple subspaces, then concatenating.
QKV	Query, Key, Value — the three learned projections in each attention head.
Positional encoding	Tells the model where each token is in the sequence (transformers have no inherent order).
Residual connection	Skip connection that adds input to output of a layer. Prevents vanishing gradients.
Layer norm	Normalises activations across the hidden dimension. Stabilises training.
FFN	Feed-forward network. Applies a 2-layer MLP to each token position independently.
MoE	Mixture of Experts — only activates a subset of model parameters per token. Used in GPT-4.
Decoder-only	Architecture where each token can only attend to previous tokens. Used by GPT, Claude, Llama.

Training

Term	What it means in practice
Pretraining	Initial training on massive text corpus to predict next token. Creates the base model.
Fine-tuning	Continued training on a smaller, task-specific dataset to specialise behaviour.
SFT	Supervised Fine-Tuning — training on (prompt, ideal response) pairs.
RLHF	Reinforcement Learning from Human Feedback — ranks responses, trains a reward model, then RL.
PPO	Proximal Policy Optimisation — the RL algorithm typically used in RLHF.
DPO	Direct Preference Optimisation — trains on preference pairs directly, simpler than RLHF.
LoRA	Low-Rank Adaptation — fine-tunes only a small set of added weight matrices. Efficient.
QLoRA	Quantised LoRA — LoRA on a quantised (4-bit) model. Fine-tune 65B model on a consumer GPU.
Constitutional AI	Anthropic's technique: model critiques its own outputs against a set of principles.
RLAIF	RL from AI Feedback — uses a strong LLM as the feedback model instead of humans.

RAG & Retrieval

Term	What it means in practice
RAG	Retrieval-Augmented Generation — retrieve relevant docs, include in prompt, generate.
Vector store	Database optimised for approximate nearest-neighbour search over embedding vectors.
Semantic search	Search by meaning (embeddings + cosine similarity) rather than keyword match.
BM25	Classic keyword-based ranking algorithm. Strong baseline, complementary to semantic search.
Hybrid search	Combining BM25 and vector search scores. Usually beats either alone.
Reranker	Cross-encoder model that re-scores top-K retrieved candidates. Expensive but accurate.
Chunking	Splitting documents into retrievable pieces. Strategy heavily affects RAG quality.
HyDE	Hypothetical Document Embeddings — embed a generated answer, not the query, for retrieval.
Contextual retrieval	Anthropic technique: add context about each chunk's document before embedding.
MMR	Maximal Marginal Relevance — selects diverse retrieved chunks, not just most similar.

Agents & Tools

Term	What it means in practice
Agent	LLM in a loop — takes actions, observes results, decides next step.
ReAct	Reason + Act — model alternates between reasoning steps and tool calls.
Tool use / function calling	Structured way to let an LLM invoke external functions with typed arguments.
MCP	Model Context Protocol — Anthropic's standard for connecting LLMs to external tools.
Agentic loop	The observe → think → act cycle that drives agent execution.
Orchestrator	A top-level agent or controller that delegates to sub-agents.
Memory (episodic)	Log of what happened in previous turns or sessions. Retrieved for context.
Memory (semantic)	Long-term facts about the user or world. Stored in a vector store.
ToT	Tree of Thoughts — explore multiple reasoning paths, backtrack on dead ends.
LATS	Language Agent Tree Search — combines ToT with MCTS for complex planning.

Evaluation & Safety

Term	What it means in practice
Hallucination	Model confidently states false information not supported by its context or training.
Faithfulness	Whether a generated answer is grounded in the provided source material.
LLM-as-judge	Using a strong LLM to score outputs against a rubric. Scalable alternative to human eval.
RAGAS	RAG evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall.
Prompt injection	Attack where instructions in data (not the system prompt) hijack model behaviour.
Jailbreak	Social-engineering technique to bypass model safety guidelines.
Guardrails	Input/output filters that enforce safety policies at the application layer.
Red teaming	Adversarial probing of a model system to find safety failures before users do.
Alignment	Research area: making models behave consistently with human values and intentions.
Evals	Evaluation suite — a set of (input, expected) pairs + judges that measure system quality.

Production / LLMOps

Term	What it means in practice
Prompt caching	Reusing KV cache for repeated prompt prefixes. Saves 80–90% cost on cached tokens.
TTFT	Time to First Token — latency until the first output token arrives. Key UX metric.
Speculative decoding	Draft model proposes tokens; main model verifies. Speeds inference 2–3×.
Quantisation	Reducing model weight precision (FP16 → INT8 → INT4). Trades accuracy for speed/memory.
vLLM	High-throughput LLM serving framework. Uses PagedAttention for efficient KV cache.
Prompt versioning	Treating prompts as code: version control, staging, evals before promotion.
Trace	Full record of an LLM call: inputs, outputs, latency, tokens, cost. Essential for debugging.
Span	Single step within a trace (one LLM call, one tool call, one retrieval).
Model routing	Directing requests to different models based on complexity, cost, or latency needs.
Shadow mode	Running a new model/prompt in parallel with production, comparing outputs without serving results.

Test your AI vocabulary →: Flashcard-style fluency drills in the Fluency module.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →