preloader
blog post

Re-ranking Strategies for RAG: Improving Retrieval Quality with Cross-Encoders

author image

You’re Retrieving the Right Documents—But in the Wrong Order

Your RAG system retrieves 10 documents from the vector database. The first result is tangentially related. The third result is exactly what you need. By the time the LLM processes all 10 results, it’s confused and picks the wrong information.

This is the re-ranking problem.

Standard RAG retrieval uses embedding-based similarity: your query gets embedded, compared to document embeddings, and the most similar ones are returned. But embedding similarity doesn’t always match relevance. Two documents can have similar embeddings while only one actually answers the question.

Re-ranking solves this by applying a second, more sophisticated ranking pass after initial retrieval. Instead of relying solely on embedding similarity, re-ranking uses specialized models designed to understand query-document relevance deeply.

The result: dramatically improved retrieval quality without changing your vector database or embedding model.

Why Basic Embedding Retrieval Fails

The embedding similarity problem:

Embeddings measure semantic similarity in vector space. Two documents can be semantically similar but have different relevance to your query.

Example:

Query: "How do we implement authentication in the API?"

Retrieved documents (by embedding similarity):
1. "Security is important. Always use encryption..." (semantic match, but generic)
2. "Our API uses REST endpoints..." (somewhat related)
3. "JWT tokens provide stateless authentication..." (EXACT answer, ranked 3rd)
4. "We discuss security throughout this guide..." (semantic match, not specific)
5. "Authentication methods vary by framework..." (relevant but general)

The embedding model sees “security” and “authentication” mentioned in result 1, giving it high similarity. But result 1 is generic security advice, not API authentication implementation.

Result 3 contains the specific answer but ranks lower because it uses different phrasing.

Why this happens:

  • Embeddings capture broad semantic meaning, not specific relevance
  • Query phrasing doesn’t perfectly match document content
  • Multiple documents discuss related topics with different specificity
  • Embedding models are trained on general text, not your domain

The Re-ranking Solution

Re-ranking applies a more powerful model to re-score retrieved documents.

The architecture:

Query: "How do we implement authentication?"
         ↓
[Embedding-based retrieval]
Retrieve top-k documents (k=100 for re-ranking)
         ↓
[Re-ranking model]
Score each document for relevance to query
         ↓
[Re-order results]
Return top-n documents (n=5 for LLM)
         ↓
LLM processes re-ranked results

Why re-ranking works:

  1. Retrieves more candidates: Get top-100 from embeddings (cheap)
  2. Scores deeply: Use powerful model on top-100 (more expensive, but only 100 docs)
  3. Returns best results: Give LLM the top-5 re-ranked documents

This two-stage approach balances cost (fast embedding retrieval) with quality (accurate re-ranking).

Strategy 1: Cross-Encoder Re-ranking

Cross-encoders are specialized models trained to score query-document pairs.

How they work:

Traditional embedding approach:
query_embedding = embed(query)
doc_embedding = embed(document)
similarity = cosine(query_embedding, doc_embedding)

Cross-encoder approach:
relevance_score = cross_encoder(query, document)

The key difference: cross-encoders process the query and document together, allowing them to capture interactions between them.

Popular cross-encoder models:

  1. ms-marco-MiniLM-L-12-v2 (Recommended for most use cases)

    • Fast: ~2-5ms per document
    • Accurate: Trained on 500K MS MARCO query-document pairs
    • Lightweight: 33M parameters, runs on CPU
    • Best for: General-purpose retrieval
  2. ms-marco-TinyBERT-L-2-v2 (For speed-critical systems)

    • Ultra-fast: <1ms per document
    • Smaller: 14M parameters
    • Slight accuracy trade-off
    • Best for: Real-time systems with strict latency budgets
  3. cross-encoder/qnli-distilroberta-base (For semantic understanding)

    • Balanced: ~3-8ms per document
    • Good accuracy on semantic tasks
    • Best for: Document understanding, complex queries
  4. Proprietary models (LLMs as re-rankers)

    • OpenAI GPT-4, Claude, etc.
    • Highest accuracy, highest cost
    • Best for: Critical systems where accuracy > cost

Implementation example:

from sentence_transformers import CrossEncoder

# Load cross-encoder model
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-12-v2')

# Initial retrieval with embeddings
query = "How do we implement authentication?"
retrieved_docs = vector_db.search(query, top_k=100)

# Re-rank using cross-encoder
scores = cross_encoder.predict([
    [query, doc['content']] for doc in retrieved_docs
])

# Re-order by cross-encoder scores
ranked_docs = [
    doc for _, doc in sorted(
        zip(scores, retrieved_docs),
        reverse=True
    )
]

# Return top-5 to LLM
top_results = ranked_docs[:5]

When to use cross-encoders:

  • You have 50-100 documents to re-rank
  • You need fast turnaround (10-50ms acceptable)
  • You want to improve retrieval without changing embeddings
  • You need explainable scores (documents get numeric relevance)

Pros:

  • Significantly better accuracy than embeddings alone
  • Relatively fast (milliseconds per document)
  • Can run on CPU or GPU
  • Easy to integrate into existing RAG pipelines
  • Transparent scoring

Cons:

  • More expensive than embedding similarity
  • Slower than pure embedding retrieval
  • Requires additional model inference
  • Not suitable for extremely large result sets (1000+ documents)

Strategy 2: LLM-Based Re-ranking

Use a language model as a re-ranker.

How it works:

# Use LLM to score relevance
prompt = f"""
Given the query: "{query}"

Rate the relevance of this document on a scale of 1-10:
"{document}"

Respond with only the number.
"""

relevance_score = int(llm(prompt))

Advantages:

  • Highest accuracy (LLMs understand nuance)
  • Can explain why a document is relevant
  • Works well for complex, domain-specific queries
  • Can handle multi-hop reasoning

Disadvantages:

  • Very expensive (LLM API calls)
  • Slow (100+ documents = 100+ API calls)
  • Rate-limited by LLM provider
  • Overkill for most retrieval tasks

When to use:

  • Accuracy is critical and cost isn’t a constraint
  • Queries are complex and require reasoning
  • You need explanations for ranking decisions
  • Small result sets (< 20 documents)

Cost-effective approach: Use LLM re-ranking on top-10 from cross-encoder, not on initial 100 results.

# Stage 1: Fast embedding retrieval
embedding_results = vector_db.search(query, top_k=100)

# Stage 2: Cross-encoder re-ranking
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-12-v2')
cross_encoder_scores = cross_encoder.predict([
    [query, doc['content']] for doc in embedding_results
])
cross_encoder_ranked = sorted(
    zip(cross_encoder_scores, embedding_results),
    reverse=True
)

# Stage 3: LLM re-ranking (only top-10)
top_10 = [doc for _, doc in cross_encoder_ranked[:10]]
llm_scores = [
    score_with_llm(query, doc) for doc in top_10
]
final_results = sorted(
    zip(llm_scores, top_10),
    reverse=True
)

This three-stage approach: fast → better → best, balancing speed and quality.

Strategy 3: Hybrid Re-ranking

Combine multiple signals for smarter ranking.

Signals to consider:

  1. Relevance score (from cross-encoder or embeddings)

    • How well does the document match the query?
  2. Freshness (document metadata)

    • Is this the latest version?
    • When was it last updated?
  3. Authority (document metadata)

    • Is this from an official source?
    • How often is it referenced?
  4. Domain match (metadata)

    • Is this from the right department/system?
    • Does it match the user’s context?
  5. Specificity (document analysis)

    • How specific is the answer?
    • Does it directly address the query?

Implementation:

def hybrid_score(query, doc, cross_encoder_model):
    # Component 1: Cross-encoder relevance (0-1)
    relevance = cross_encoder_model.predict([query, doc['content']])[0]
    
    # Component 2: Freshness (0-1)
    # Decay score based on age
    days_old = (datetime.now() - doc['updated_at']).days
    freshness = max(0, 1 - (days_old / 365))  # 1 year decay
    
    # Component 3: Authority (0-1)
    # Higher for official docs, lower for user-generated
    authority = 1.0 if doc['source'] == 'official' else 0.5
    
    # Component 4: Domain match (0-1)
    domain_match = 1.0 if doc['domain'] == user_domain else 0.7
    
    # Component 5: Specificity (0-1)
    # Score based on whether doc directly answers query
    specificity = 0.9 if doc['directly_answers_query'] else 0.6
    
    # Weighted combination
    final_score = (
        0.5 * relevance +      # Relevance is most important
        0.2 * freshness +      # Recent docs are better
        0.15 * authority +     # Official sources matter
        0.1 * domain_match +   # Context matters
        0.05 * specificity     # Specificity is a tiebreaker
    )
    
    return final_score

# Re-rank all documents
scores = [hybrid_score(query, doc, cross_encoder) for doc in retrieved_docs]
ranked_docs = sorted(zip(scores, retrieved_docs), reverse=True)

When to use:

  • You have rich metadata about documents
  • Different document types need different ranking
  • You want to balance multiple concerns (relevance, freshness, authority)
  • You have domain-specific ranking requirements

Benefits:

  • More sophisticated ranking
  • Can incorporate business logic
  • Handles edge cases better
  • Transparent weighting

Challenges:

  • More complex to implement
  • Requires tuning weights
  • Needs good metadata
  • Can be harder to debug

Strategy 4: Fine-tuned Re-rankers

Train your own re-ranker on your domain.

When to consider:

  • You have thousands of query-document pairs with relevance labels
  • Generic models don’t work well for your domain
  • You have specific ranking requirements
  • You can invest in training infrastructure

Data requirements:

query, document, relevance_label
"How do we handle user authentication?", "JWT tokens provide...", 5
"How do we handle user authentication?", "Security is important...", 2
"How do we handle user authentication?", "REST endpoints...", 1
...
(need 1000+ labeled examples)

Training approach:

from sentence_transformers import CrossEncoder, InputExample
from sentence_transformers.losses import CosineSimilarityLoss

# Prepare training data
train_examples = [
    InputExample(texts=[query, doc], label=score)
    for query, doc, score in training_data
]

# Fine-tune model
model = CrossEncoder('cross-encoder/qnli-distilroberta-base')
model.fit(
    train_objectives=[(train_dataloader, CosineSimilarityLoss(model))],
    epochs=1,
    warmup_steps=100,
    show_progress_bar=True
)

# Use fine-tuned model
scores = model.predict([[query, doc] for doc in retrieved_docs])

Benefits:

  • Best accuracy for your specific domain
  • Can capture domain-specific relevance signals
  • Competitive advantage

Costs:

  • Requires labeled training data
  • Needs ML expertise
  • Training time and compute
  • Ongoing maintenance

Re-ranking Implementation Checklist

Before implementing:

  • Measure current retrieval quality (baseline metrics)
  • Identify failure cases (where ranking is wrong)
  • Determine acceptable latency budget
  • Calculate cost of re-ranking
  • Decide: embedding-only vs. re-ranking tradeoff

Choosing a strategy:

  • Fast iteration? → Cross-encoder (ms-marco-MiniLM)
  • Maximum accuracy? → LLM re-ranking
  • Complex requirements? → Hybrid scoring
  • Have labeled data? → Fine-tuned model

Implementation:

  • Integrate re-ranker into retrieval pipeline
  • Monitor latency (should add <100ms)
  • Test on representative queries
  • Measure improvement in ranking quality
  • Set up A/B testing for validation

Deployment:

  • Load model efficiently (cache, batching)
  • Handle failures gracefully (fallback to embedding-only)
  • Monitor model performance over time
  • Plan for model updates

Common Re-ranking Mistakes

Mistake 1: Re-ranking everything Re-ranking 1000 documents is expensive and unnecessary. Re-rank only top-100 from embeddings.

Mistake 2: Using the wrong model LLM re-ranking is powerful but expensive. Use cross-encoders for most cases, LLMs only for top-10.

Mistake 3: Not measuring improvement You can’t tell if re-ranking helps without metrics. Measure ranking quality before and after.

Mistake 4: Ignoring latency Adding 500ms to retrieval breaks user experience. Monitor end-to-end latency.

Mistake 5: Forgetting to cache Re-ranking the same documents repeatedly is wasteful. Cache scores when possible.

Re-ranking in Calliope

AI Lab:

  • Experiment with different re-ranker models
  • Benchmark cross-encoder vs. LLM vs. embedding-only
  • Build custom re-ranking pipelines
  • Test on your actual documents

Chat Studio:

  • Automatic re-ranking with configurable models
  • Hybrid scoring with metadata
  • Performance monitoring built-in
  • Easy model switching

Langflow:

  • Visual re-ranking workflow builder
  • Chain embedding retrieval → re-ranking
  • Support for custom scoring functions
  • Integration with external models

The Bottom Line

Re-ranking is the most effective way to improve RAG retrieval quality without changing your vector database or embedding model.

The ranking hierarchy:

  1. Embedding-only: Fast, baseline quality
  2. + Cross-encoder: 30-50% quality improvement, minimal latency cost
  3. + Hybrid scoring: Domain-specific improvements, more complex
  4. + LLM re-ranking: Maximum quality, higher cost
  5. + Fine-tuned model: Best for your domain, requires training data

Start with cross-encoder re-ranking. It’s the best balance of quality, speed, and simplicity.

Improve RAG retrieval with re-ranking in Calliope →

Related Articles