Re-ranking Strategies for RAG: Improving Retrieval Relevance Beyond Vector Search

Re-ranking Strategies for RAG: Improving Retrieval Relevance Beyond Vector Search

May 14, 2025 - 8 Min read

The RAG Retrieval Problem: Quantity Over Quality

You’ve optimized your chunking strategy. Your embeddings are solid. Your vector database is fast. But when users ask questions, your RAG system still returns irrelevant results in the top positions.

The problem isn’t retrieval—it’s ranking.

Vector similarity search retrieves candidate documents, but it doesn’t always rank them correctly. A chunk might be semantically similar to the query without actually answering it. Or multiple chunks might be relevant, but the most useful one ranks third instead of first.

Re-ranking solves this by applying a second-stage ranking model after initial retrieval. While the first stage casts a wide net (find candidates), re-ranking focuses on precision (order by relevance).

This two-stage approach dramatically improves RAG quality without changing your embedding model or vector database.

Why Re-ranking Matters

The cost of bad ranking:

Users see irrelevant information first
LLM wastes context window on low-quality results
Answer quality degrades even with good retrieval
System feels unreliable

Real-world impact: A legal research company retrieved relevant documents but ranked them poorly. Lawyers had to scroll past irrelevant results to find the answer. Implementing re-ranking improved answer quality by 40% with zero changes to the vector database.

The Two-Stage Retrieval Pipeline

Modern RAG systems use two stages:

Query
  ↓
[Stage 1: Vector Search]
  → Fast, broad retrieval
  → Returns top 50-100 candidates
  ↓
[Stage 2: Re-ranking]
  → Slower, precise ranking
  → Re-orders top candidates
  → Returns top 5-10 for LLM
  ↓
Context for LLM

Why two stages?

Vector search is fast but imprecise (returns false positives)
Re-ranking is slower but accurate (precise ordering)
Combined: fast enough for real-time, accurate enough for quality

Strategy 1: Cross-Encoder Re-ranking

Cross-encoders are specialized models that score query-document pairs directly.

How they work:

Input: [query, document_chunk]
       ↓
[BERT/DeBERTa encoder]
       ↓
Output: Relevance score (0-1)

Unlike bi-encoders (which embed query and document separately), cross-encoders process the pair together, capturing their relationship.

When to use:

When precision is critical (legal, medical, finance)
When you have computational budget for re-ranking
When documents are moderately sized (< 512 tokens)
When latency allows 100-500ms for re-ranking

Implementation:

from sentence_transformers import CrossEncoder

# Initialize re-ranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

# Initial retrieval from vector DB
candidates = vector_db.search(query_embedding, top_k=50)

# Re-rank candidates
pairs = [[query, doc['content']] for doc in candidates]
scores = reranker.predict(pairs)

# Sort by re-ranking scores
ranked = sorted(
    zip(candidates, scores),
    key=lambda x: x[1],
    reverse=True
)

# Return top 5 for LLM
context = [doc for doc, score in ranked[:5]]

Popular cross-encoder models:

cross-encoder/ms-marco-MiniLM-L-12-v2 - Fast, good for web search
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Multilingual
cross-encoder/qnli-distilroberta-base - Good for QA
cross-encoder/ms-marco-TinyBERT-L-2-v2 - Ultra-fast for mobile
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Production-grade

Pros:

Highly accurate relevance scoring
Works with any embedding model
Captures query-document interactions
Relatively small models (100MB-400MB)

Cons:

Slower than vector search (100-500ms per query)
Requires GPU for real-time performance
Can’t pre-compute scores (must score at query time)
Latency increases with number of candidates

Strategy 2: LLM-Based Re-ranking

Use an LLM to score or rank retrieved documents.

Approach 1: Direct scoring

# Ask LLM to score each document
prompt = f"""
Given this query: {query}

Rate how relevant each document is (0-10):

Document 1: {doc1}
Score: [LLM outputs score]

Document 2: {doc2}
Score: [LLM outputs score]
"""

response = llm(prompt)
# Parse scores and re-rank

Approach 2: Pairwise comparison

# Compare documents pairwise (like chess tournament ranking)
prompt = f"""
Query: {query}

Which document is more relevant?
A) {doc_a}
B) {doc_b}

Answer: [A or B]
"""

# Run pairwise comparisons
# Build ranking from results

Approach 3: List-wise ranking

# Ask LLM to order all documents
prompt = f"""
Query: {query}

Rank these documents by relevance (most to least relevant):

1. {doc1}
2. {doc2}
3. {doc3}
...

Ranking: [LLM orders them]
"""

When to use:

When you have fast LLM access (cached models, local inference)
When documents are short (< 200 tokens)
When you can afford 1-5 second latency
When you want semantic understanding (not just pattern matching)

Pros:

Uses your existing LLM (no new models to deploy)
Can incorporate semantic understanding
Works with any document type
Can explain reasoning

Cons:

Much slower than cross-encoders (1-5 seconds)
Expensive (token cost per query)
Inconsistent scoring (LLMs aren’t deterministic)
Risk of hallucination in scoring

Strategy 3: Hybrid Re-ranking (Combining Signals)

Combine multiple scoring signals for robust ranking.

Multi-signal approach:

# Signal 1: Vector similarity
vector_score = cosine_similarity(query_embedding, doc_embedding)

# Signal 2: BM25 keyword match
bm25_score = bm25.score(query, doc_content)

# Signal 3: Cross-encoder relevance
cross_encoder_score = reranker.predict([[query, doc_content]])[0]

# Signal 4: Document metadata
recency_score = (1 - days_old / max_days)  # Prefer recent
authority_score = doc_metadata['authority_score']

# Combine signals with weights
final_score = (
    0.2 * vector_score +
    0.2 * bm25_score +
    0.4 * cross_encoder_score +
    0.1 * recency_score +
    0.1 * authority_score
)

# Rank by combined score
ranked = sorted(candidates, key=lambda x: final_score[x['id']], reverse=True)

Signal selection:

Vector similarity: Semantic relevance (always include)
BM25: Keyword matching (important for factual queries)
Cross-encoder: Precise relevance (if budget allows)
Metadata: Freshness, authority, source quality
User signals: Click-through, dwell time (if available)

When to use:

When you need robust ranking across diverse queries
When documents have rich metadata
When combining multiple retrieval methods
In production systems with mature analytics

Pros:

Robust to individual signal failures
Can tune weights for your use case
Combines strengths of multiple approaches
Explainable (see which signals contributed)

Cons:

More complex to implement and tune
Requires tuning weights (data-dependent)
Slower than single-stage ranking
Requires monitoring multiple signals

Strategy 4: Learning-to-Rank (LTR)

Train a ranking model on your domain-specific data.

The approach:

1. Collect training data:
   - Queries
   - Retrieved documents
   - Human relevance judgments (0-5 scale)

2. Extract features for each (query, document) pair:
   - Vector similarity
   - BM25 score
   - Query length
   - Document length
   - Metadata features
   - Cross-encoder score

3. Train ranking model:
   - Input: Feature vectors
   - Output: Relevance score
   - Model: LambdaMART, XGBoost, neural network

4. Deploy trained model:
   - Extract features at query time
   - Run trained model
   - Use predicted scores for ranking

Example with LambdaMART:

from rankeval.model import LambdaMART

# Prepare training data
X_train = extract_features(queries, documents, relevance_labels)
y_train = relevance_labels

# Train model
ranker = LambdaMART(
    metric='ndcg',
    num_leaves=31,
    num_iterations=100
)
ranker.fit(X_train, y_train)

# Use at inference time
X_test = extract_features(query, candidates)
scores = ranker.predict(X_test)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

When to use:

When you have 1000+ labeled examples
When you need optimal ranking for your domain
When you can afford to maintain a model
In mature production systems

Pros:

Optimal ranking for your specific domain
Learns complex feature interactions
Can incorporate any feature type
Proven effective in information retrieval

Cons:

Requires significant labeled training data
Complex to implement and maintain
Needs periodic retraining
Slower inference than simpler methods

Practical Implementation Guide

Step 1: Choose your re-ranking strategy

Scenario	Best Strategy
Need quick wins	Cross-encoder (MiniLM)
Have fast LLM	LLM-based scoring
Multiple signals	Hybrid re-ranking
Mature system	Learning-to-rank

Step 2: Measure baseline performance

Before implementing re-ranking, establish metrics:

def evaluate_ranking(queries, candidates, relevance_labels):
    # Mean Reciprocal Rank (MRR)
    mrr = mean([1/rank for rank in relevant_ranks])
    
    # Normalized Discounted Cumulative Gain (NDCG@5)
    ndcg = ndcg_score(relevance_labels, predicted_scores, k=5)
    
    # Precision@5
    precision = sum(relevant[:5]) / 5
    
    # Mean Average Precision (MAP)
    ap = mean_average_precision(relevance_labels, predicted_scores)
    
    return {'MRR': mrr, 'NDCG@5': ndcg, 'P@5': precision, 'MAP': ap}

Step 3: Implement re-ranking

class RAGWithReranking:
    def __init__(self, vector_db, reranker, top_k_initial=50, top_k_final=5):
        self.vector_db = vector_db
        self.reranker = reranker
        self.top_k_initial = top_k_initial
        self.top_k_final = top_k_final
    
    def retrieve(self, query):
        # Stage 1: Vector search
        query_embedding = embed(query)
        candidates = self.vector_db.search(query_embedding, self.top_k_initial)
        
        # Stage 2: Re-ranking
        scores = self.reranker.score(query, candidates)
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )
        
        # Return top results
        return [doc for doc, score in ranked[:self.top_k_final]]

Step 4: Measure improvement

# Compare before and after
baseline_metrics = evaluate_ranking(test_queries, baseline_results)
reranked_metrics = evaluate_ranking(test_queries, reranked_results)

improvement = {
    k: (reranked_metrics[k] - baseline_metrics[k]) / baseline_metrics[k] * 100
    for k in baseline_metrics.keys()
}

print(f"NDCG@5 improvement: {improvement['NDCG@5']:.1f}%")
print(f"MRR improvement: {improvement['MRR']:.1f}%")

Re-ranking Performance Comparison

Strategy	Speed	Accuracy	Cost	Complexity
Cross-encoder (MiniLM)	100-200ms	High	Low	Low
Cross-encoder (Large)	200-500ms	Very High	Low	Low
LLM-based	1-5s	High	Medium	Medium
Hybrid (2-3 signals)	200-400ms	High	Low	Medium
Learning-to-rank	50-100ms	Very High	Medium	High

Common Re-ranking Mistakes

Mistake 1: Re-ranking everything Re-ranking 10,000 documents is slow and expensive. Always retrieve a smaller candidate set first (50-100), then re-rank.

Mistake 2: Using wrong cross-encoder Different cross-encoders are trained on different tasks (MS MARCO, QNLI, etc.). Match the model to your use case.

Mistake 3: No baseline measurement Implement re-ranking blindly without measuring improvement. You might not see the benefit.

Mistake 4: Ignoring latency Re-ranking adds latency. If your target is <500ms total, cross-encoder re-ranking might not fit. Measure end-to-end.

Mistake 5: Static weights In hybrid re-ranking, don’t hardcode weights. Tune them on a validation set for your specific domain.

Re-ranking in Calliope

Calliope integrates re-ranking seamlessly:

Chat Studio:

Built-in cross-encoder re-ranking
Configurable candidate pool size
Automatic metric tracking
A/B testing support

Langflow:

Visual re-ranking nodes
Easy integration with custom models
Chainable ranking strategies
Debug scoring at each stage

AI Lab:

Experiment with different re-rankers
Evaluate on your documents
Compare ranking strategies
Iterate before production

The Bottom Line

Re-ranking is the highest-ROI improvement you can make to RAG systems. It requires minimal changes to existing infrastructure but dramatically improves result quality.

The best approach: