Query Expansion and Rewriting: Improving RAG Retrieval Quality

Query Expansion and Rewriting: Improving RAG Retrieval Quality

Aug 22, 2025 - 8 Min read

The Query Mismatch Problem

Your RAG system has perfect documents. Your embeddings are state-of-the-art. Your vector database is properly indexed. Yet users ask questions and get no results.

The problem isn’t your documents or your retrieval infrastructure. It’s the gap between how users ask questions and how relevant information is written.

A user asks: “How do I fix my broken microwave?” Your documents contain: “Troubleshooting common appliance failures” and “Microwave repair procedures”

The semantic meaning is aligned. But the exact phrasing differs. Vector similarity might miss the connection.

This is the query mismatch problem, and it’s one of the most common causes of RAG retrieval failures. Query expansion and rewriting are powerful techniques to bridge this gap.

Understanding Query Expansion

Query expansion takes a user’s original question and generates variations that might match documents better.

Example: Original query: “How do I authenticate users?”

Expanded queries:

“User authentication methods”
“Implementing OAuth and JWT”
“Login system implementation”
“Session management for web apps”
“Multi-factor authentication setup”

Each variation targets different document phrasings. If one doesn’t match, another might.

Strategy 1: LLM-Based Query Expansion

Use an LLM to generate query variations.

from openai import OpenAI

def expand_query_with_llm(query, num_variations=5):
    """
    Use LLM to generate query variations.
    """
    client = OpenAI()
    
    prompt = f"""Generate {num_variations} alternative phrasings of this question that would help find the same information. These variations should use different terminology and approaches while maintaining the core intent.

Original question: {query}

Return only the alternative questions, one per line, without numbering or explanation."""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
    )
    
    variations = response.choices[0].message.content.strip().split('
')
    variations = [v.strip() for v in variations if v.strip()]
    
    return [query] + variations  # Include original

Advantages:

Captures semantic intent naturally
Handles complex questions well
Can generate domain-specific variations
Works with any document type

Disadvantages:

Requires LLM API calls (latency and cost)
May generate off-topic variations
Needs prompt engineering for quality

Cost Optimization:

def batch_expand_queries(queries, cache_size=100):
    """
    Cache expansions to reduce API calls.
    """
    cache = {}
    
    for query in queries:
        # Check cache first
        if query in cache:
            yield cache[query]
            continue
        
        # Generate expansion
        expanded = expand_query_with_llm(query)
        cache[query] = expanded
        
        # Trim cache if too large
        if len(cache) > cache_size:
            # Remove oldest entry
            oldest = next(iter(cache))
            del cache[oldest]
        
        yield expanded

Strategy 2: Keyword-Based Expansion

Generate variations by manipulating keywords.

import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

def keyword_expansion(query, max_variations=5):
    """
    Generate variations using synonyms and related terms.
    """
    tokens = word_tokenize(query.lower())
    variations = [query]  # Include original
    
    for token in tokens:
        # Find synonyms
        synonyms = set()
        for synset in wordnet.synsets(token):
            for lemma in synset.lemmas():
                if lemma.name() != token:
                    synonyms.add(lemma.name())
        
        # Create variations with each synonym
        for synonym in list(synonyms)[:2]:  # Limit to 2 per token
            variation = query.replace(token, synonym)
            if variation not in variations:
                variations.append(variation)
            
            if len(variations) >= max_variations:
                break
        
        if len(variations) >= max_variations:
            break
    
    return variations[:max_variations]

Example:

Original: "How do I authenticate users?"

Variations:
- "How do I verify users?"
- "How do I validate users?"
- "How do I confirm user identity?"
- "How do I establish user credentials?"

Advantages:

Fast and deterministic
No API calls needed
Works offline
Predictable behavior

Disadvantages:

Limited semantic understanding
May generate awkward variations
Requires quality synonym databases
Misses paraphrasing opportunities

Strategy 3: Query Rewriting

Rather than expanding, rewrite the query to match document phrasing better.

def rewrite_query(query, document_sample=None):
    """
    Rewrite query to better match document style.
    
    If document_sample provided, adapt to that style.
    """
    client = OpenAI()
    
    prompt = f"""Rewrite this question to be more likely to match technical documentation. 
Use imperative form, technical terminology, and concrete specifics.
Keep the core intent but change the phrasing.

Original: {query}"""
    
    if document_sample:
        prompt += f"

Match the style of this documentation:
{document_sample[:500]}"
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,  # Lower temperature for consistency
    )
    
    return response.choices[0].message.content.strip()

Example Rewrites:

User: "What's the best way to store passwords?"
Rewritten: "Implement secure password storage using bcrypt hashing"

User: "How do I make my API faster?"
Rewritten: "Optimize API response time through caching and indexing"

User: "Can you explain machine learning?"
Rewritten: "Explain supervised learning algorithms and training processes"

Strategy 4: Multi-Step Query Decomposition

Break complex questions into sub-queries.

def decompose_query(query):
    """
    Break complex query into simpler sub-queries.
    """
    client = OpenAI()
    
    prompt = f"""Break this complex question into 2-4 simpler sub-questions that together answer the original question.

Question: {query}

Return only the sub-questions, one per line."""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5,
    )
    
    sub_queries = response.choices[0].message.content.strip().split('
')
    return [q.strip() for q in sub_queries if q.strip()]

Example:

Original: "How do I build a production RAG system with security and monitoring?"

Decomposed into:
1. "What are the architecture components of a production RAG system?"
2. "How do I implement security in RAG systems?"
3. "How do I set up monitoring and logging for RAG?"
4. "What are best practices for deploying RAG to production?"

When to Use:

Questions with multiple parts
Questions requiring context from different sections
Complex technical questions
Questions combining multiple domains

Strategy 5: Context-Aware Expansion

Use conversation history to improve expansion.

def context_aware_expansion(query, conversation_history=None):
    """
    Expand query considering previous messages.
    """
    if not conversation_history:
        return expand_query_with_llm(query)
    
    client = OpenAI()
    
    # Build context from history
    context = "Previous conversation:
"
    for msg in conversation_history[-3:]:  # Last 3 messages
        context += f"- {msg}
"
    
    prompt = f"""{context}

Current question: {query}

Generate 3-5 variations of the current question that account for the conversation context. 
These should help find relevant information given what was already discussed."""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.6,
    )
    
    variations = response.choices[0].message.content.strip().split('
')
    return [query] + [v.strip() for v in variations if v.strip()]

Example:

User: "Tell me about authentication"
Assistant: "Here are the main authentication methods..."

User: "What about OAuth?"
Without context: Searches for "What about OAuth?"
With context: Searches for:
  - "OAuth authentication implementation"
  - "OAuth vs other authentication methods"
  - "OAuth 2.0 protocol"
  - "Implementing OAuth in production"

Combining Expansion with Retrieval

The power comes from using expansions effectively.

def retrieve_with_expansion(query, vector_db, num_results=5):
    """
    Retrieve using query expansion.
    """
    # Generate expanded queries
    expanded = expand_query_with_llm(query, num_variations=3)
    
    # Retrieve for each query
    all_results = []
    result_scores = {}
    
    for expanded_query in expanded:
        results = vector_db.search(expanded_query, top_k=num_results)
        
        for result in results:
            doc_id = result['id']
            score = result['score']
            
            # Track results across queries
            if doc_id not in result_scores:
                result_scores[doc_id] = []
            result_scores[doc_id].append(score)
    
    # Rank by average score across queries
    ranked = sorted(
        result_scores.items(),
        key=lambda x: np.mean(x[1]),
        reverse=True
    )
    
    # Return top results
    return [
        vector_db.get(doc_id)
        for doc_id, scores in ranked[:num_results]
    ]

Key Insight: Results that match multiple query variations are likely more relevant than results matching only one.

Measuring Expansion Quality

def evaluate_expansion_strategy(
    test_queries,
    ground_truth,  # Known relevant documents
    expansion_func,
    retrieval_func
):
    """
    Evaluate how well expansion improves retrieval.
    """
    metrics = {
        'recall_without_expansion': 0,
        'recall_with_expansion': 0,
        'avg_rank_without': 0,
        'avg_rank_with': 0,
    }
    
    for query, relevant_docs in test_queries:
        # Without expansion
        results_without = retrieval_func(query, top_k=10)
        retrieved_ids = [r['id'] for r in results_without]
        
        found_without = sum(1 for doc_id in relevant_docs if doc_id in retrieved_ids)
        metrics['recall_without_expansion'] += found_without / len(relevant_docs)
        
        # With expansion
        expanded = expansion_func(query)
        results_with = retrieval_func_multi(expanded, top_k=10)
        retrieved_ids = [r['id'] for r in results_with]
        
        found_with = sum(1 for doc_id in relevant_docs if doc_id in retrieved_ids)
        metrics['recall_with_expansion'] += found_with / len(relevant_docs)
    
    # Average across all queries
    n = len(test_queries)
    for key in metrics:
        metrics[key] /= n
    
    return metrics

Production Considerations

Latency:

LLM expansion adds 500ms-2s per query
Cache aggressively
Consider async expansion
Use faster models for expansion than generation

Cost:

Each query might trigger 3-5 LLM calls
Implement query caching
Use cheaper models for expansion
Monitor expansion effectiveness

Quality:

Not all expansions are useful
Some may drift from original intent
Monitor false positives
Adjust number of variations based on results

Practical Implementation Strategy

class AdaptiveQueryExpander:
    """
    Expansion strategy that learns what works.
    """
    def __init__(self):
        self.expansion_cache = {}
        self.effectiveness = {}  # Track which expansions help
    
    def expand(self, query, max_variations=5):
        # Check cache
        if query in self.expansion_cache:
            return self.expansion_cache[query]
        
        # Use multiple strategies
        keyword_vars = keyword_expansion(query, max_variations=2)
        llm_vars = expand_query_with_llm(query, num_variations=3)
        
        # Combine, removing duplicates
        all_vars = list(set(keyword_vars + llm_vars))[:max_variations]
        
        # Cache
        self.expansion_cache[query] = all_vars
        return all_vars
    
    def track_effectiveness(self, query, expansion, retrieved_relevant):
        """
        Track which expansions actually helped.
        """
        if expansion not in self.effectiveness:
            self.effectiveness[expansion] = {'helped': 0, 'total': 0}
        
        self.effectiveness[expansion]['total'] += 1
        if retrieved_relevant:
            self.effectiveness[expansion]['helped'] += 1
    
    def get_best_expansions(self):
        """
        Return most effective expansions.
        """
        return sorted(
            self.effectiveness.items(),
            key=lambda x: x[1]['helped'] / x[1]['total'],
            reverse=True
        )

Common Pitfalls

Over-expansion: Generating so many variations that noise overwhelms signal. 3-5 variations usually optimal.

Semantic drift: Expansions that change the original intent. “How do I authenticate?” becoming “How do I encrypt?” is too far.

Ignoring cost: LLM expansion is expensive. Cache results and measure ROI.

Not measuring: Using expansion without tracking if it actually improves results.

Recommendations

Start with keyword expansion - Fast, no cost, baseline improvement
Add LLM expansion - Better quality, measure if worth the cost
Use decomposition for complex queries - Breaks down hard problems
Cache everything - Avoid redundant expansions
Measure effectiveness - Track actual improvement in retrieval quality
Combine with re-ranking - Expansion finds candidates, re-ranking ranks them

In Calliope

Calliope’s AI Lab and Chat Studio support:

Automatic query expansion
Configurable expansion strategies
Caching for performance
Effectiveness tracking
Integration with retrieval pipeline

Query expansion is often the highest-ROI improvement you can make to RAG systems. Users ask in their language. Documents are written in theirs. Bridge the gap.

Improve retrieval in Calliope →