preloader
blog post

Re-ranking Strategies for RAG: Improving Retrieval Relevance Beyond Vector Search

author image

The RAG Retrieval Problem: Quantity Over Quality

You’ve optimized your chunking strategy. Your embeddings are solid. Your vector database is fast. But when users ask questions, your RAG system still returns irrelevant results in the top positions.

The problem isn’t retrieval—it’s ranking.

Vector similarity search retrieves candidate documents, but it doesn’t always rank them correctly. A chunk might be semantically similar to the query without actually answering it. Or multiple chunks might be relevant, but the most useful one ranks third instead of first.

Re-ranking solves this by applying a second-stage ranking model after initial retrieval. While the first stage casts a wide net (find candidates), re-ranking focuses on precision (order by relevance).

This two-stage approach dramatically improves RAG quality without changing your embedding model or vector database.

Why Re-ranking Matters

The cost of bad ranking:

  • Users see irrelevant information first
  • LLM wastes context window on low-quality results
  • Answer quality degrades even with good retrieval
  • System feels unreliable

Real-world impact: A legal research company retrieved relevant documents but ranked them poorly. Lawyers had to scroll past irrelevant results to find the answer. Implementing re-ranking improved answer quality by 40% with zero changes to the vector database.

The Two-Stage Retrieval Pipeline

Modern RAG systems use two stages:

Query
  ↓
[Stage 1: Vector Search]
  → Fast, broad retrieval
  → Returns top 50-100 candidates
  ↓
[Stage 2: Re-ranking]
  → Slower, precise ranking
  → Re-orders top candidates
  → Returns top 5-10 for LLM
  ↓
Context for LLM

Why two stages?

  • Vector search is fast but imprecise (returns false positives)
  • Re-ranking is slower but accurate (precise ordering)
  • Combined: fast enough for real-time, accurate enough for quality

Strategy 1: Cross-Encoder Re-ranking

Cross-encoders are specialized models that score query-document pairs directly.

How they work:

Input: [query, document_chunk]
       ↓
[BERT/DeBERTa encoder]
       ↓
Output: Relevance score (0-1)

Unlike bi-encoders (which embed query and document separately), cross-encoders process the pair together, capturing their relationship.

When to use:

  • When precision is critical (legal, medical, finance)
  • When you have computational budget for re-ranking
  • When documents are moderately sized (< 512 tokens)
  • When latency allows 100-500ms for re-ranking

Implementation:

from sentence_transformers import CrossEncoder

# Initialize re-ranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

# Initial retrieval from vector DB
candidates = vector_db.search(query_embedding, top_k=50)

# Re-rank candidates
pairs = [[query, doc['content']] for doc in candidates]
scores = reranker.predict(pairs)

# Sort by re-ranking scores
ranked = sorted(
    zip(candidates, scores),
    key=lambda x: x[1],
    reverse=True
)

# Return top 5 for LLM
context = [doc for doc, score in ranked[:5]]

Popular cross-encoder models:

  • cross-encoder/ms-marco-MiniLM-L-12-v2 - Fast, good for web search
  • cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Multilingual
  • cross-encoder/qnli-distilroberta-base - Good for QA
  • cross-encoder/ms-marco-TinyBERT-L-2-v2 - Ultra-fast for mobile
  • cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Production-grade

Pros:

  • Highly accurate relevance scoring
  • Works with any embedding model
  • Captures query-document interactions
  • Relatively small models (100MB-400MB)

Cons:

  • Slower than vector search (100-500ms per query)
  • Requires GPU for real-time performance
  • Can’t pre-compute scores (must score at query time)
  • Latency increases with number of candidates

Strategy 2: LLM-Based Re-ranking

Use an LLM to score or rank retrieved documents.

Approach 1: Direct scoring

# Ask LLM to score each document
prompt = f"""
Given this query: {query}

Rate how relevant each document is (0-10):

Document 1: {doc1}
Score: [LLM outputs score]

Document 2: {doc2}
Score: [LLM outputs score]
"""

response = llm(prompt)
# Parse scores and re-rank

Approach 2: Pairwise comparison

# Compare documents pairwise (like chess tournament ranking)
prompt = f"""
Query: {query}

Which document is more relevant?
A) {doc_a}
B) {doc_b}

Answer: [A or B]
"""

# Run pairwise comparisons
# Build ranking from results

Approach 3: List-wise ranking

# Ask LLM to order all documents
prompt = f"""
Query: {query}

Rank these documents by relevance (most to least relevant):

1. {doc1}
2. {doc2}
3. {doc3}
...

Ranking: [LLM orders them]
"""

When to use:

  • When you have fast LLM access (cached models, local inference)
  • When documents are short (< 200 tokens)
  • When you can afford 1-5 second latency
  • When you want semantic understanding (not just pattern matching)

Pros:

  • Uses your existing LLM (no new models to deploy)
  • Can incorporate semantic understanding
  • Works with any document type
  • Can explain reasoning

Cons:

  • Much slower than cross-encoders (1-5 seconds)
  • Expensive (token cost per query)
  • Inconsistent scoring (LLMs aren’t deterministic)
  • Risk of hallucination in scoring

Strategy 3: Hybrid Re-ranking (Combining Signals)

Combine multiple scoring signals for robust ranking.

Multi-signal approach:

# Signal 1: Vector similarity
vector_score = cosine_similarity(query_embedding, doc_embedding)

# Signal 2: BM25 keyword match
bm25_score = bm25.score(query, doc_content)

# Signal 3: Cross-encoder relevance
cross_encoder_score = reranker.predict([[query, doc_content]])[0]

# Signal 4: Document metadata
recency_score = (1 - days_old / max_days)  # Prefer recent
authority_score = doc_metadata['authority_score']

# Combine signals with weights
final_score = (
    0.2 * vector_score +
    0.2 * bm25_score +
    0.4 * cross_encoder_score +
    0.1 * recency_score +
    0.1 * authority_score
)

# Rank by combined score
ranked = sorted(candidates, key=lambda x: final_score[x['id']], reverse=True)

Signal selection:

  • Vector similarity: Semantic relevance (always include)
  • BM25: Keyword matching (important for factual queries)
  • Cross-encoder: Precise relevance (if budget allows)
  • Metadata: Freshness, authority, source quality
  • User signals: Click-through, dwell time (if available)

When to use:

  • When you need robust ranking across diverse queries
  • When documents have rich metadata
  • When combining multiple retrieval methods
  • In production systems with mature analytics

Pros:

  • Robust to individual signal failures
  • Can tune weights for your use case
  • Combines strengths of multiple approaches
  • Explainable (see which signals contributed)

Cons:

  • More complex to implement and tune
  • Requires tuning weights (data-dependent)
  • Slower than single-stage ranking
  • Requires monitoring multiple signals

Strategy 4: Learning-to-Rank (LTR)

Train a ranking model on your domain-specific data.

The approach:

1. Collect training data:
   - Queries
   - Retrieved documents
   - Human relevance judgments (0-5 scale)

2. Extract features for each (query, document) pair:
   - Vector similarity
   - BM25 score
   - Query length
   - Document length
   - Metadata features
   - Cross-encoder score

3. Train ranking model:
   - Input: Feature vectors
   - Output: Relevance score
   - Model: LambdaMART, XGBoost, neural network

4. Deploy trained model:
   - Extract features at query time
   - Run trained model
   - Use predicted scores for ranking

Example with LambdaMART:

from rankeval.model import LambdaMART

# Prepare training data
X_train = extract_features(queries, documents, relevance_labels)
y_train = relevance_labels

# Train model
ranker = LambdaMART(
    metric='ndcg',
    num_leaves=31,
    num_iterations=100
)
ranker.fit(X_train, y_train)

# Use at inference time
X_test = extract_features(query, candidates)
scores = ranker.predict(X_test)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

When to use:

  • When you have 1000+ labeled examples
  • When you need optimal ranking for your domain
  • When you can afford to maintain a model
  • In mature production systems

Pros:

  • Optimal ranking for your specific domain
  • Learns complex feature interactions
  • Can incorporate any feature type
  • Proven effective in information retrieval

Cons:

  • Requires significant labeled training data
  • Complex to implement and maintain
  • Needs periodic retraining
  • Slower inference than simpler methods

Practical Implementation Guide

Step 1: Choose your re-ranking strategy

ScenarioBest Strategy
Need quick winsCross-encoder (MiniLM)
Have fast LLMLLM-based scoring
Multiple signalsHybrid re-ranking
Mature systemLearning-to-rank

Step 2: Measure baseline performance

Before implementing re-ranking, establish metrics:

def evaluate_ranking(queries, candidates, relevance_labels):
    # Mean Reciprocal Rank (MRR)
    mrr = mean([1/rank for rank in relevant_ranks])
    
    # Normalized Discounted Cumulative Gain (NDCG@5)
    ndcg = ndcg_score(relevance_labels, predicted_scores, k=5)
    
    # Precision@5
    precision = sum(relevant[:5]) / 5
    
    # Mean Average Precision (MAP)
    ap = mean_average_precision(relevance_labels, predicted_scores)
    
    return {'MRR': mrr, 'NDCG@5': ndcg, 'P@5': precision, 'MAP': ap}

Step 3: Implement re-ranking

class RAGWithReranking:
    def __init__(self, vector_db, reranker, top_k_initial=50, top_k_final=5):
        self.vector_db = vector_db
        self.reranker = reranker
        self.top_k_initial = top_k_initial
        self.top_k_final = top_k_final
    
    def retrieve(self, query):
        # Stage 1: Vector search
        query_embedding = embed(query)
        candidates = self.vector_db.search(query_embedding, self.top_k_initial)
        
        # Stage 2: Re-ranking
        scores = self.reranker.score(query, candidates)
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )
        
        # Return top results
        return [doc for doc, score in ranked[:self.top_k_final]]

Step 4: Measure improvement

# Compare before and after
baseline_metrics = evaluate_ranking(test_queries, baseline_results)
reranked_metrics = evaluate_ranking(test_queries, reranked_results)

improvement = {
    k: (reranked_metrics[k] - baseline_metrics[k]) / baseline_metrics[k] * 100
    for k in baseline_metrics.keys()
}

print(f"NDCG@5 improvement: {improvement['NDCG@5']:.1f}%")
print(f"MRR improvement: {improvement['MRR']:.1f}%")

Re-ranking Performance Comparison

StrategySpeedAccuracyCostComplexity
Cross-encoder (MiniLM)100-200msHighLowLow
Cross-encoder (Large)200-500msVery HighLowLow
LLM-based1-5sHighMediumMedium
Hybrid (2-3 signals)200-400msHighLowMedium
Learning-to-rank50-100msVery HighMediumHigh

Common Re-ranking Mistakes

Mistake 1: Re-ranking everything Re-ranking 10,000 documents is slow and expensive. Always retrieve a smaller candidate set first (50-100), then re-rank.

Mistake 2: Using wrong cross-encoder Different cross-encoders are trained on different tasks (MS MARCO, QNLI, etc.). Match the model to your use case.

Mistake 3: No baseline measurement Implement re-ranking blindly without measuring improvement. You might not see the benefit.

Mistake 4: Ignoring latency Re-ranking adds latency. If your target is <500ms total, cross-encoder re-ranking might not fit. Measure end-to-end.

Mistake 5: Static weights In hybrid re-ranking, don’t hardcode weights. Tune them on a validation set for your specific domain.

Re-ranking in Calliope

Calliope integrates re-ranking seamlessly:

Chat Studio:

  • Built-in cross-encoder re-ranking
  • Configurable candidate pool size
  • Automatic metric tracking
  • A/B testing support

Langflow:

  • Visual re-ranking nodes
  • Easy integration with custom models
  • Chainable ranking strategies
  • Debug scoring at each stage

AI Lab:

  • Experiment with different re-rankers
  • Evaluate on your documents
  • Compare ranking strategies
  • Iterate before production

The Bottom Line

Re-ranking is the highest-ROI improvement you can make to RAG systems. It requires minimal changes to existing infrastructure but dramatically improves result quality.

The best approach:

  1. Start with vector search baseline
  2. Add cross-encoder re-ranking (quick win)
  3. Measure improvement (should see 20-40% quality gain)
  4. Iterate to hybrid or LLM-based if needed
  5. Graduate to learning-to-rank as you scale

Re-ranking separates production RAG systems from toy prototypes. Implement it early.

Improve your RAG with re-ranking in Calliope →

Related Articles