
Introducing Calliope CLI: Open Source Multi-Model AI for Your Terminal
Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

Your RAG system retrieves 10 documents from the vector database. The first result is tangentially related. The third result is exactly what you need. By the time the LLM processes all 10 results, it’s confused and picks the wrong information.
This is the re-ranking problem.
Standard RAG retrieval uses embedding-based similarity: your query gets embedded, compared to document embeddings, and the most similar ones are returned. But embedding similarity doesn’t always match relevance. Two documents can have similar embeddings while only one actually answers the question.
Re-ranking solves this by applying a second, more sophisticated ranking pass after initial retrieval. Instead of relying solely on embedding similarity, re-ranking uses specialized models designed to understand query-document relevance deeply.
The result: dramatically improved retrieval quality without changing your vector database or embedding model.
The embedding similarity problem:
Embeddings measure semantic similarity in vector space. Two documents can be semantically similar but have different relevance to your query.
Example:
Query: "How do we implement authentication in the API?"
Retrieved documents (by embedding similarity):
1. "Security is important. Always use encryption..." (semantic match, but generic)
2. "Our API uses REST endpoints..." (somewhat related)
3. "JWT tokens provide stateless authentication..." (EXACT answer, ranked 3rd)
4. "We discuss security throughout this guide..." (semantic match, not specific)
5. "Authentication methods vary by framework..." (relevant but general)
The embedding model sees “security” and “authentication” mentioned in result 1, giving it high similarity. But result 1 is generic security advice, not API authentication implementation.
Result 3 contains the specific answer but ranks lower because it uses different phrasing.
Why this happens:
Re-ranking applies a more powerful model to re-score retrieved documents.
The architecture:
Query: "How do we implement authentication?"
↓
[Embedding-based retrieval]
Retrieve top-k documents (k=100 for re-ranking)
↓
[Re-ranking model]
Score each document for relevance to query
↓
[Re-order results]
Return top-n documents (n=5 for LLM)
↓
LLM processes re-ranked results
Why re-ranking works:
This two-stage approach balances cost (fast embedding retrieval) with quality (accurate re-ranking).
Cross-encoders are specialized models trained to score query-document pairs.
How they work:
Traditional embedding approach:
query_embedding = embed(query)
doc_embedding = embed(document)
similarity = cosine(query_embedding, doc_embedding)
Cross-encoder approach:
relevance_score = cross_encoder(query, document)
The key difference: cross-encoders process the query and document together, allowing them to capture interactions between them.
Popular cross-encoder models:
ms-marco-MiniLM-L-12-v2 (Recommended for most use cases)
ms-marco-TinyBERT-L-2-v2 (For speed-critical systems)
cross-encoder/qnli-distilroberta-base (For semantic understanding)
Proprietary models (LLMs as re-rankers)
Implementation example:
from sentence_transformers import CrossEncoder
# Load cross-encoder model
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-12-v2')
# Initial retrieval with embeddings
query = "How do we implement authentication?"
retrieved_docs = vector_db.search(query, top_k=100)
# Re-rank using cross-encoder
scores = cross_encoder.predict([
[query, doc['content']] for doc in retrieved_docs
])
# Re-order by cross-encoder scores
ranked_docs = [
doc for _, doc in sorted(
zip(scores, retrieved_docs),
reverse=True
)
]
# Return top-5 to LLM
top_results = ranked_docs[:5]
When to use cross-encoders:
Pros:
Cons:
Use a language model as a re-ranker.
How it works:
# Use LLM to score relevance
prompt = f"""
Given the query: "{query}"
Rate the relevance of this document on a scale of 1-10:
"{document}"
Respond with only the number.
"""
relevance_score = int(llm(prompt))
Advantages:
Disadvantages:
When to use:
Cost-effective approach: Use LLM re-ranking on top-10 from cross-encoder, not on initial 100 results.
# Stage 1: Fast embedding retrieval
embedding_results = vector_db.search(query, top_k=100)
# Stage 2: Cross-encoder re-ranking
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-12-v2')
cross_encoder_scores = cross_encoder.predict([
[query, doc['content']] for doc in embedding_results
])
cross_encoder_ranked = sorted(
zip(cross_encoder_scores, embedding_results),
reverse=True
)
# Stage 3: LLM re-ranking (only top-10)
top_10 = [doc for _, doc in cross_encoder_ranked[:10]]
llm_scores = [
score_with_llm(query, doc) for doc in top_10
]
final_results = sorted(
zip(llm_scores, top_10),
reverse=True
)
This three-stage approach: fast → better → best, balancing speed and quality.
Combine multiple signals for smarter ranking.
Signals to consider:
Relevance score (from cross-encoder or embeddings)
Freshness (document metadata)
Authority (document metadata)
Domain match (metadata)
Specificity (document analysis)
Implementation:
def hybrid_score(query, doc, cross_encoder_model):
# Component 1: Cross-encoder relevance (0-1)
relevance = cross_encoder_model.predict([query, doc['content']])[0]
# Component 2: Freshness (0-1)
# Decay score based on age
days_old = (datetime.now() - doc['updated_at']).days
freshness = max(0, 1 - (days_old / 365)) # 1 year decay
# Component 3: Authority (0-1)
# Higher for official docs, lower for user-generated
authority = 1.0 if doc['source'] == 'official' else 0.5
# Component 4: Domain match (0-1)
domain_match = 1.0 if doc['domain'] == user_domain else 0.7
# Component 5: Specificity (0-1)
# Score based on whether doc directly answers query
specificity = 0.9 if doc['directly_answers_query'] else 0.6
# Weighted combination
final_score = (
0.5 * relevance + # Relevance is most important
0.2 * freshness + # Recent docs are better
0.15 * authority + # Official sources matter
0.1 * domain_match + # Context matters
0.05 * specificity # Specificity is a tiebreaker
)
return final_score
# Re-rank all documents
scores = [hybrid_score(query, doc, cross_encoder) for doc in retrieved_docs]
ranked_docs = sorted(zip(scores, retrieved_docs), reverse=True)
When to use:
Benefits:
Challenges:
Train your own re-ranker on your domain.
When to consider:
Data requirements:
query, document, relevance_label
"How do we handle user authentication?", "JWT tokens provide...", 5
"How do we handle user authentication?", "Security is important...", 2
"How do we handle user authentication?", "REST endpoints...", 1
...
(need 1000+ labeled examples)
Training approach:
from sentence_transformers import CrossEncoder, InputExample
from sentence_transformers.losses import CosineSimilarityLoss
# Prepare training data
train_examples = [
InputExample(texts=[query, doc], label=score)
for query, doc, score in training_data
]
# Fine-tune model
model = CrossEncoder('cross-encoder/qnli-distilroberta-base')
model.fit(
train_objectives=[(train_dataloader, CosineSimilarityLoss(model))],
epochs=1,
warmup_steps=100,
show_progress_bar=True
)
# Use fine-tuned model
scores = model.predict([[query, doc] for doc in retrieved_docs])
Benefits:
Costs:
Before implementing:
Choosing a strategy:
Implementation:
Deployment:
Mistake 1: Re-ranking everything Re-ranking 1000 documents is expensive and unnecessary. Re-rank only top-100 from embeddings.
Mistake 2: Using the wrong model LLM re-ranking is powerful but expensive. Use cross-encoders for most cases, LLMs only for top-10.
Mistake 3: Not measuring improvement You can’t tell if re-ranking helps without metrics. Measure ranking quality before and after.
Mistake 4: Ignoring latency Adding 500ms to retrieval breaks user experience. Monitor end-to-end latency.
Mistake 5: Forgetting to cache Re-ranking the same documents repeatedly is wasteful. Cache scores when possible.
AI Lab:
Chat Studio:
Langflow:
Re-ranking is the most effective way to improve RAG retrieval quality without changing your vector database or embedding model.
The ranking hierarchy:
Start with cross-encoder re-ranking. It’s the best balance of quality, speed, and simplicity.

Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

Understanding the Math Behind Modern AI Vector embeddings are everywhere in AI now. They power RAG systems, semantic …