AI Engineering
12 min read
The AI Vocabulary Cheat Sheet: 80 Terms You Need to Know Cold
From attention to zero-shot — the 80 terms that come up in AI interviews, product reviews, and engineering discussions. Definitions plus context for each.
These are the 80 terms you'll encounter in AI engineering interviews, design docs, and production conversations. Definitions are kept practical — what the term means in context, not a textbook definition.
Foundations
| Term | What it means in practice |
|---|
| Token | The atomic unit an LLM processes — roughly 4 chars in English. Costs are per-token. |
| Context window | The maximum number of tokens a model can 'see' at once — prompt + response combined. |
| Embedding | A fixed-length float vector representing the meaning of text. Similar texts have nearby vectors. |
| Temperature | Controls randomness at generation. 0 = deterministic. 1 = default. >1 = more random. |
| Logits | Raw scores output by the model before softmax. Sampling operates on these. |
| Top-K / Top-P | Sampling limits: Top-K restricts to K most likely tokens; Top-P uses probability mass threshold. |
| Greedy decoding | Always pick the highest-probability next token. Deterministic but prone to repetition. |
| Beam search | Explore K sequences in parallel, keep best. Slower than greedy, better quality. |
| Perplexity | How 'surprised' the model is by text. Lower perplexity = model finds text more probable. |
| BPE | Byte Pair Encoding — the tokenisation algorithm used by GPT/Claude. Merges common char pairs. |
Architecture
| Term | What it means in practice |
|---|
| Transformer | The architecture underlying all major LLMs. Key components: attention, FFN, residuals. |
| Self-attention | Mechanism where each token attends to all others. Core of the transformer. |
| Multi-head attention | Running attention in parallel across multiple subspaces, then concatenating. |
| QKV | Query, Key, Value — the three learned projections in each attention head. |
| Positional encoding | Tells the model where each token is in the sequence (transformers have no inherent order). |
| Residual connection | Skip connection that adds input to output of a layer. Prevents vanishing gradients. |
| Layer norm | Normalises activations across the hidden dimension. Stabilises training. |
| FFN | Feed-forward network. Applies a 2-layer MLP to each token position independently. |
| MoE | Mixture of Experts — only activates a subset of model parameters per token. Used in GPT-4. |
| Decoder-only | Architecture where each token can only attend to previous tokens. Used by GPT, Claude, Llama. |
Training
| Term | What it means in practice |
|---|
| Pretraining | Initial training on massive text corpus to predict next token. Creates the base model. |
| Fine-tuning | Continued training on a smaller, task-specific dataset to specialise behaviour. |
| SFT | Supervised Fine-Tuning — training on (prompt, ideal response) pairs. |
| RLHF | Reinforcement Learning from Human Feedback — ranks responses, trains a reward model, then RL. |
| PPO | Proximal Policy Optimisation — the RL algorithm typically used in RLHF. |
| DPO | Direct Preference Optimisation — trains on preference pairs directly, simpler than RLHF. |
| LoRA | Low-Rank Adaptation — fine-tunes only a small set of added weight matrices. Efficient. |
| QLoRA | Quantised LoRA — LoRA on a quantised (4-bit) model. Fine-tune 65B model on a consumer GPU. |
| Constitutional AI | Anthropic's technique: model critiques its own outputs against a set of principles. |
| RLAIF | RL from AI Feedback — uses a strong LLM as the feedback model instead of humans. |
RAG & Retrieval
| Term | What it means in practice |
|---|
| RAG | Retrieval-Augmented Generation — retrieve relevant docs, include in prompt, generate. |
| Vector store | Database optimised for approximate nearest-neighbour search over embedding vectors. |
| Semantic search | Search by meaning (embeddings + cosine similarity) rather than keyword match. |
| BM25 | Classic keyword-based ranking algorithm. Strong baseline, complementary to semantic search. |
| Hybrid search | Combining BM25 and vector search scores. Usually beats either alone. |
| Reranker | Cross-encoder model that re-scores top-K retrieved candidates. Expensive but accurate. |
| Chunking | Splitting documents into retrievable pieces. Strategy heavily affects RAG quality. |
| HyDE | Hypothetical Document Embeddings — embed a generated answer, not the query, for retrieval. |
| Contextual retrieval | Anthropic technique: add context about each chunk's document before embedding. |
| MMR | Maximal Marginal Relevance — selects diverse retrieved chunks, not just most similar. |
Agents & Tools
| Term | What it means in practice |
|---|
| Agent | LLM in a loop — takes actions, observes results, decides next step. |
| ReAct | Reason + Act — model alternates between reasoning steps and tool calls. |
| Tool use / function calling | Structured way to let an LLM invoke external functions with typed arguments. |
| MCP | Model Context Protocol — Anthropic's standard for connecting LLMs to external tools. |
| Agentic loop | The observe → think → act cycle that drives agent execution. |
| Orchestrator | A top-level agent or controller that delegates to sub-agents. |
| Memory (episodic) | Log of what happened in previous turns or sessions. Retrieved for context. |
| Memory (semantic) | Long-term facts about the user or world. Stored in a vector store. |
| ToT | Tree of Thoughts — explore multiple reasoning paths, backtrack on dead ends. |
| LATS | Language Agent Tree Search — combines ToT with MCTS for complex planning. |
Evaluation & Safety
| Term | What it means in practice |
|---|
| Hallucination | Model confidently states false information not supported by its context or training. |
| Faithfulness | Whether a generated answer is grounded in the provided source material. |
| LLM-as-judge | Using a strong LLM to score outputs against a rubric. Scalable alternative to human eval. |
| RAGAS | RAG evaluation framework. Metrics: faithfulness, answer relevancy, context precision/recall. |
| Prompt injection | Attack where instructions in data (not the system prompt) hijack model behaviour. |
| Jailbreak | Social-engineering technique to bypass model safety guidelines. |
| Guardrails | Input/output filters that enforce safety policies at the application layer. |
| Red teaming | Adversarial probing of a model system to find safety failures before users do. |
| Alignment | Research area: making models behave consistently with human values and intentions. |
| Evals | Evaluation suite — a set of (input, expected) pairs + judges that measure system quality. |
Production / LLMOps
| Term | What it means in practice |
|---|
| Prompt caching | Reusing KV cache for repeated prompt prefixes. Saves 80–90% cost on cached tokens. |
| TTFT | Time to First Token — latency until the first output token arrives. Key UX metric. |
| Speculative decoding | Draft model proposes tokens; main model verifies. Speeds inference 2–3×. |
| Quantisation | Reducing model weight precision (FP16 → INT8 → INT4). Trades accuracy for speed/memory. |
| vLLM | High-throughput LLM serving framework. Uses PagedAttention for efficient KV cache. |
| Prompt versioning | Treating prompts as code: version control, staging, evals before promotion. |
| Trace | Full record of an LLM call: inputs, outputs, latency, tokens, cost. Essential for debugging. |
| Span | Single step within a trace (one LLM call, one tool call, one retrieval). |
| Model routing | Directing requests to different models based on complexity, cost, or latency needs. |
| Shadow mode | Running a new model/prompt in parallel with production, comparing outputs without serving results. |
Test your AI vocabulary →: Flashcard-style fluency drills in the Fluency module.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →