
Introducing Calliope CLI: Open Source Multi-Model AI for Your Terminal
Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

You’ve optimized your chunking strategy. Your embeddings are solid. Your vector database is fast. But when users ask questions, your RAG system still returns irrelevant results in the top positions.
The problem isn’t retrieval—it’s ranking.
Vector similarity search retrieves candidate documents, but it doesn’t always rank them correctly. A chunk might be semantically similar to the query without actually answering it. Or multiple chunks might be relevant, but the most useful one ranks third instead of first.
Re-ranking solves this by applying a second-stage ranking model after initial retrieval. While the first stage casts a wide net (find candidates), re-ranking focuses on precision (order by relevance).
This two-stage approach dramatically improves RAG quality without changing your embedding model or vector database.
The cost of bad ranking:
Real-world impact: A legal research company retrieved relevant documents but ranked them poorly. Lawyers had to scroll past irrelevant results to find the answer. Implementing re-ranking improved answer quality by 40% with zero changes to the vector database.
Modern RAG systems use two stages:
Query
↓
[Stage 1: Vector Search]
→ Fast, broad retrieval
→ Returns top 50-100 candidates
↓
[Stage 2: Re-ranking]
→ Slower, precise ranking
→ Re-orders top candidates
→ Returns top 5-10 for LLM
↓
Context for LLM
Why two stages?
Cross-encoders are specialized models that score query-document pairs directly.
How they work:
Input: [query, document_chunk]
↓
[BERT/DeBERTa encoder]
↓
Output: Relevance score (0-1)
Unlike bi-encoders (which embed query and document separately), cross-encoders process the pair together, capturing their relationship.
When to use:
Implementation:
from sentence_transformers import CrossEncoder
# Initialize re-ranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
# Initial retrieval from vector DB
candidates = vector_db.search(query_embedding, top_k=50)
# Re-rank candidates
pairs = [[query, doc['content']] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by re-ranking scores
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
# Return top 5 for LLM
context = [doc for doc, score in ranked[:5]]
Popular cross-encoder models:
cross-encoder/ms-marco-MiniLM-L-12-v2 - Fast, good for web searchcross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Multilingualcross-encoder/qnli-distilroberta-base - Good for QAcross-encoder/ms-marco-TinyBERT-L-2-v2 - Ultra-fast for mobilecross-encoder/mmarco-mMiniLMv2-L12-H384-v1 - Production-gradePros:
Cons:
Use an LLM to score or rank retrieved documents.
Approach 1: Direct scoring
# Ask LLM to score each document
prompt = f"""
Given this query: {query}
Rate how relevant each document is (0-10):
Document 1: {doc1}
Score: [LLM outputs score]
Document 2: {doc2}
Score: [LLM outputs score]
"""
response = llm(prompt)
# Parse scores and re-rank
Approach 2: Pairwise comparison
# Compare documents pairwise (like chess tournament ranking)
prompt = f"""
Query: {query}
Which document is more relevant?
A) {doc_a}
B) {doc_b}
Answer: [A or B]
"""
# Run pairwise comparisons
# Build ranking from results
Approach 3: List-wise ranking
# Ask LLM to order all documents
prompt = f"""
Query: {query}
Rank these documents by relevance (most to least relevant):
1. {doc1}
2. {doc2}
3. {doc3}
...
Ranking: [LLM orders them]
"""
When to use:
Pros:
Cons:
Combine multiple scoring signals for robust ranking.
Multi-signal approach:
# Signal 1: Vector similarity
vector_score = cosine_similarity(query_embedding, doc_embedding)
# Signal 2: BM25 keyword match
bm25_score = bm25.score(query, doc_content)
# Signal 3: Cross-encoder relevance
cross_encoder_score = reranker.predict([[query, doc_content]])[0]
# Signal 4: Document metadata
recency_score = (1 - days_old / max_days) # Prefer recent
authority_score = doc_metadata['authority_score']
# Combine signals with weights
final_score = (
0.2 * vector_score +
0.2 * bm25_score +
0.4 * cross_encoder_score +
0.1 * recency_score +
0.1 * authority_score
)
# Rank by combined score
ranked = sorted(candidates, key=lambda x: final_score[x['id']], reverse=True)
Signal selection:
When to use:
Pros:
Cons:
Train a ranking model on your domain-specific data.
The approach:
1. Collect training data:
- Queries
- Retrieved documents
- Human relevance judgments (0-5 scale)
2. Extract features for each (query, document) pair:
- Vector similarity
- BM25 score
- Query length
- Document length
- Metadata features
- Cross-encoder score
3. Train ranking model:
- Input: Feature vectors
- Output: Relevance score
- Model: LambdaMART, XGBoost, neural network
4. Deploy trained model:
- Extract features at query time
- Run trained model
- Use predicted scores for ranking
Example with LambdaMART:
from rankeval.model import LambdaMART
# Prepare training data
X_train = extract_features(queries, documents, relevance_labels)
y_train = relevance_labels
# Train model
ranker = LambdaMART(
metric='ndcg',
num_leaves=31,
num_iterations=100
)
ranker.fit(X_train, y_train)
# Use at inference time
X_test = extract_features(query, candidates)
scores = ranker.predict(X_test)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
When to use:
Pros:
Cons:
Step 1: Choose your re-ranking strategy
| Scenario | Best Strategy |
|---|---|
| Need quick wins | Cross-encoder (MiniLM) |
| Have fast LLM | LLM-based scoring |
| Multiple signals | Hybrid re-ranking |
| Mature system | Learning-to-rank |
Step 2: Measure baseline performance
Before implementing re-ranking, establish metrics:
def evaluate_ranking(queries, candidates, relevance_labels):
# Mean Reciprocal Rank (MRR)
mrr = mean([1/rank for rank in relevant_ranks])
# Normalized Discounted Cumulative Gain (NDCG@5)
ndcg = ndcg_score(relevance_labels, predicted_scores, k=5)
# Precision@5
precision = sum(relevant[:5]) / 5
# Mean Average Precision (MAP)
ap = mean_average_precision(relevance_labels, predicted_scores)
return {'MRR': mrr, 'NDCG@5': ndcg, 'P@5': precision, 'MAP': ap}
Step 3: Implement re-ranking
class RAGWithReranking:
def __init__(self, vector_db, reranker, top_k_initial=50, top_k_final=5):
self.vector_db = vector_db
self.reranker = reranker
self.top_k_initial = top_k_initial
self.top_k_final = top_k_final
def retrieve(self, query):
# Stage 1: Vector search
query_embedding = embed(query)
candidates = self.vector_db.search(query_embedding, self.top_k_initial)
# Stage 2: Re-ranking
scores = self.reranker.score(query, candidates)
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
# Return top results
return [doc for doc, score in ranked[:self.top_k_final]]
Step 4: Measure improvement
# Compare before and after
baseline_metrics = evaluate_ranking(test_queries, baseline_results)
reranked_metrics = evaluate_ranking(test_queries, reranked_results)
improvement = {
k: (reranked_metrics[k] - baseline_metrics[k]) / baseline_metrics[k] * 100
for k in baseline_metrics.keys()
}
print(f"NDCG@5 improvement: {improvement['NDCG@5']:.1f}%")
print(f"MRR improvement: {improvement['MRR']:.1f}%")
| Strategy | Speed | Accuracy | Cost | Complexity |
|---|---|---|---|---|
| Cross-encoder (MiniLM) | 100-200ms | High | Low | Low |
| Cross-encoder (Large) | 200-500ms | Very High | Low | Low |
| LLM-based | 1-5s | High | Medium | Medium |
| Hybrid (2-3 signals) | 200-400ms | High | Low | Medium |
| Learning-to-rank | 50-100ms | Very High | Medium | High |
Mistake 1: Re-ranking everything Re-ranking 10,000 documents is slow and expensive. Always retrieve a smaller candidate set first (50-100), then re-rank.
Mistake 2: Using wrong cross-encoder Different cross-encoders are trained on different tasks (MS MARCO, QNLI, etc.). Match the model to your use case.
Mistake 3: No baseline measurement Implement re-ranking blindly without measuring improvement. You might not see the benefit.
Mistake 4: Ignoring latency Re-ranking adds latency. If your target is <500ms total, cross-encoder re-ranking might not fit. Measure end-to-end.
Mistake 5: Static weights In hybrid re-ranking, don’t hardcode weights. Tune them on a validation set for your specific domain.
Calliope integrates re-ranking seamlessly:
Chat Studio:
Langflow:
AI Lab:
Re-ranking is the highest-ROI improvement you can make to RAG systems. It requires minimal changes to existing infrastructure but dramatically improves result quality.
The best approach:
Re-ranking separates production RAG systems from toy prototypes. Implement it early.

Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

Understanding the Math Behind Modern AI Vector embeddings are everywhere in AI now. They power RAG systems, semantic …