AI Engineering 9 min read

Multi-Hop Reasoning Failures: When Your RAG System Can't Connect the Dots

Why single-step retrieval fails on questions that require chaining two or three facts. How multi-hop and three-hop failures look, and how to architect around them.

You ask your RAG system: 'Who is the manager of the team that owns the product that generated the most revenue last quarter?' Every fact needed to answer that question exists in your knowledge base. But no single document contains all of them. Your system confidently answers incorrectly — or worse, gives a plausible-sounding wrong name.

This is multi-hop reasoning failure: questions that require connecting facts across multiple retrieved documents, where naive RAG — retrieve once, answer once — breaks down.

Why naive RAG can't multi-hop

Naive RAG embeds the original question, retrieves the most similar chunks, and generates an answer. For the question above, the most similar chunks might tell you which product had the most revenue. But those chunks don't contain who manages the team that owns that product — that's a different document, not retrieved because it didn't match the original query.

The model then either: answers based on incomplete information (confidently wrong), or admits it doesn't know (unhelpful). Neither is what your users need.

The fixes

1. Iterative retrieval

Instead of one retrieve-then-answer step, decompose the question into sub-questions and retrieve for each. Step 1: 'Which product had the most revenue last quarter?' → retrieve, get answer: 'ProductX'. Step 2: 'Which team owns ProductX?' → retrieve, get answer: 'Platform team'. Step 3: 'Who manages the Platform team?' → retrieve, get answer: 'Priya Mehta'. Compose the final answer.

def multi_hop_answer(question, vector_store, llm, max_hops=4):
    context_so_far = []
    current_question = question

    for hop in range(max_hops):
        # Retrieve for current sub-question
        chunks = vector_store.search(current_question, top_k=3)
        context_so_far.extend(chunks)

        # Ask: do we have enough to answer the original question?
        check_prompt = f"""Original question: {question}
Context gathered so far: {format_chunks(context_so_far)}
Can you answer the original question now? If yes, answer it.
If no, what single follow-up question would get you the missing information?
Respond: {{"can_answer": true/false, "answer": "...", "next_question": "..."}}"""

        result = json.loads(llm(check_prompt))
        if result["can_answer"]:
            return result["answer"]
        current_question = result["next_question"]

    return llm(f"Answer as best you can: {question}\nContext: {format_chunks(context_so_far)}")

2. Knowledge graph augmentation

Extract entities and relationships from your documents and store them in a graph database (Neo4j, Neptune). For multi-hop queries, traverse the graph first to find connected entities, then use those entities to anchor retrieval. The graph gives you the connection; the vector store gives you the content.

3. Query decomposition upfront

Before retrieval, use an LLM to decompose the complex question into a list of atomic sub-questions. Retrieve for each sub-question in parallel. Merge the results. Generate the final answer from the complete merged context.

When to invest in multi-hop

Multi-hop adds latency (multiple LLM calls) and cost. It's worth it when: your knowledge base has deeply interconnected entities (org charts, product hierarchies, dependency graphs), users regularly ask relationship-type questions, and wrong answers have real consequences. For simple FAQ retrieval, naive RAG is sufficient.

Build a multi-hop retriever →: Implement iterative retrieval for complex questions in the RAG lab.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →