
Introducing Calliope CLI: Open Source Multi-Model AI for Your Terminal
Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

Your RAG system has perfect documents. Your embeddings are state-of-the-art. Your vector database is properly indexed. Yet users ask questions and get no results.
The problem isn’t your documents or your retrieval infrastructure. It’s the gap between how users ask questions and how relevant information is written.
A user asks: “How do I fix my broken microwave?” Your documents contain: “Troubleshooting common appliance failures” and “Microwave repair procedures”
The semantic meaning is aligned. But the exact phrasing differs. Vector similarity might miss the connection.
This is the query mismatch problem, and it’s one of the most common causes of RAG retrieval failures. Query expansion and rewriting are powerful techniques to bridge this gap.
Query expansion takes a user’s original question and generates variations that might match documents better.
Example: Original query: “How do I authenticate users?”
Expanded queries:
Each variation targets different document phrasings. If one doesn’t match, another might.
Use an LLM to generate query variations.
from openai import OpenAI
def expand_query_with_llm(query, num_variations=5):
"""
Use LLM to generate query variations.
"""
client = OpenAI()
prompt = f"""Generate {num_variations} alternative phrasings of this question that would help find the same information. These variations should use different terminology and approaches while maintaining the core intent.
Original question: {query}
Return only the alternative questions, one per line, without numbering or explanation."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)
variations = response.choices[0].message.content.strip().split('
')
variations = [v.strip() for v in variations if v.strip()]
return [query] + variations # Include original
Advantages:
Disadvantages:
Cost Optimization:
def batch_expand_queries(queries, cache_size=100):
"""
Cache expansions to reduce API calls.
"""
cache = {}
for query in queries:
# Check cache first
if query in cache:
yield cache[query]
continue
# Generate expansion
expanded = expand_query_with_llm(query)
cache[query] = expanded
# Trim cache if too large
if len(cache) > cache_size:
# Remove oldest entry
oldest = next(iter(cache))
del cache[oldest]
yield expanded
Generate variations by manipulating keywords.
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
def keyword_expansion(query, max_variations=5):
"""
Generate variations using synonyms and related terms.
"""
tokens = word_tokenize(query.lower())
variations = [query] # Include original
for token in tokens:
# Find synonyms
synonyms = set()
for synset in wordnet.synsets(token):
for lemma in synset.lemmas():
if lemma.name() != token:
synonyms.add(lemma.name())
# Create variations with each synonym
for synonym in list(synonyms)[:2]: # Limit to 2 per token
variation = query.replace(token, synonym)
if variation not in variations:
variations.append(variation)
if len(variations) >= max_variations:
break
if len(variations) >= max_variations:
break
return variations[:max_variations]
Example:
Original: "How do I authenticate users?"
Variations:
- "How do I verify users?"
- "How do I validate users?"
- "How do I confirm user identity?"
- "How do I establish user credentials?"
Advantages:
Disadvantages:
Rather than expanding, rewrite the query to match document phrasing better.
def rewrite_query(query, document_sample=None):
"""
Rewrite query to better match document style.
If document_sample provided, adapt to that style.
"""
client = OpenAI()
prompt = f"""Rewrite this question to be more likely to match technical documentation.
Use imperative form, technical terminology, and concrete specifics.
Keep the core intent but change the phrasing.
Original: {query}"""
if document_sample:
prompt += f"
Match the style of this documentation:
{document_sample[:500]}"
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.3, # Lower temperature for consistency
)
return response.choices[0].message.content.strip()
Example Rewrites:
User: "What's the best way to store passwords?"
Rewritten: "Implement secure password storage using bcrypt hashing"
User: "How do I make my API faster?"
Rewritten: "Optimize API response time through caching and indexing"
User: "Can you explain machine learning?"
Rewritten: "Explain supervised learning algorithms and training processes"
Break complex questions into sub-queries.
def decompose_query(query):
"""
Break complex query into simpler sub-queries.
"""
client = OpenAI()
prompt = f"""Break this complex question into 2-4 simpler sub-questions that together answer the original question.
Question: {query}
Return only the sub-questions, one per line."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.5,
)
sub_queries = response.choices[0].message.content.strip().split('
')
return [q.strip() for q in sub_queries if q.strip()]
Example:
Original: "How do I build a production RAG system with security and monitoring?"
Decomposed into:
1. "What are the architecture components of a production RAG system?"
2. "How do I implement security in RAG systems?"
3. "How do I set up monitoring and logging for RAG?"
4. "What are best practices for deploying RAG to production?"
When to Use:
Use conversation history to improve expansion.
def context_aware_expansion(query, conversation_history=None):
"""
Expand query considering previous messages.
"""
if not conversation_history:
return expand_query_with_llm(query)
client = OpenAI()
# Build context from history
context = "Previous conversation:
"
for msg in conversation_history[-3:]: # Last 3 messages
context += f"- {msg}
"
prompt = f"""{context}
Current question: {query}
Generate 3-5 variations of the current question that account for the conversation context.
These should help find relevant information given what was already discussed."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.6,
)
variations = response.choices[0].message.content.strip().split('
')
return [query] + [v.strip() for v in variations if v.strip()]
Example:
User: "Tell me about authentication"
Assistant: "Here are the main authentication methods..."
User: "What about OAuth?"
Without context: Searches for "What about OAuth?"
With context: Searches for:
- "OAuth authentication implementation"
- "OAuth vs other authentication methods"
- "OAuth 2.0 protocol"
- "Implementing OAuth in production"
The power comes from using expansions effectively.
def retrieve_with_expansion(query, vector_db, num_results=5):
"""
Retrieve using query expansion.
"""
# Generate expanded queries
expanded = expand_query_with_llm(query, num_variations=3)
# Retrieve for each query
all_results = []
result_scores = {}
for expanded_query in expanded:
results = vector_db.search(expanded_query, top_k=num_results)
for result in results:
doc_id = result['id']
score = result['score']
# Track results across queries
if doc_id not in result_scores:
result_scores[doc_id] = []
result_scores[doc_id].append(score)
# Rank by average score across queries
ranked = sorted(
result_scores.items(),
key=lambda x: np.mean(x[1]),
reverse=True
)
# Return top results
return [
vector_db.get(doc_id)
for doc_id, scores in ranked[:num_results]
]
Key Insight: Results that match multiple query variations are likely more relevant than results matching only one.
def evaluate_expansion_strategy(
test_queries,
ground_truth, # Known relevant documents
expansion_func,
retrieval_func
):
"""
Evaluate how well expansion improves retrieval.
"""
metrics = {
'recall_without_expansion': 0,
'recall_with_expansion': 0,
'avg_rank_without': 0,
'avg_rank_with': 0,
}
for query, relevant_docs in test_queries:
# Without expansion
results_without = retrieval_func(query, top_k=10)
retrieved_ids = [r['id'] for r in results_without]
found_without = sum(1 for doc_id in relevant_docs if doc_id in retrieved_ids)
metrics['recall_without_expansion'] += found_without / len(relevant_docs)
# With expansion
expanded = expansion_func(query)
results_with = retrieval_func_multi(expanded, top_k=10)
retrieved_ids = [r['id'] for r in results_with]
found_with = sum(1 for doc_id in relevant_docs if doc_id in retrieved_ids)
metrics['recall_with_expansion'] += found_with / len(relevant_docs)
# Average across all queries
n = len(test_queries)
for key in metrics:
metrics[key] /= n
return metrics
Latency:
Cost:
Quality:
class AdaptiveQueryExpander:
"""
Expansion strategy that learns what works.
"""
def __init__(self):
self.expansion_cache = {}
self.effectiveness = {} # Track which expansions help
def expand(self, query, max_variations=5):
# Check cache
if query in self.expansion_cache:
return self.expansion_cache[query]
# Use multiple strategies
keyword_vars = keyword_expansion(query, max_variations=2)
llm_vars = expand_query_with_llm(query, num_variations=3)
# Combine, removing duplicates
all_vars = list(set(keyword_vars + llm_vars))[:max_variations]
# Cache
self.expansion_cache[query] = all_vars
return all_vars
def track_effectiveness(self, query, expansion, retrieved_relevant):
"""
Track which expansions actually helped.
"""
if expansion not in self.effectiveness:
self.effectiveness[expansion] = {'helped': 0, 'total': 0}
self.effectiveness[expansion]['total'] += 1
if retrieved_relevant:
self.effectiveness[expansion]['helped'] += 1
def get_best_expansions(self):
"""
Return most effective expansions.
"""
return sorted(
self.effectiveness.items(),
key=lambda x: x[1]['helped'] / x[1]['total'],
reverse=True
)
Over-expansion: Generating so many variations that noise overwhelms signal. 3-5 variations usually optimal.
Semantic drift: Expansions that change the original intent. “How do I authenticate?” becoming “How do I encrypt?” is too far.
Ignoring cost: LLM expansion is expensive. Cache results and measure ROI.
Not measuring: Using expansion without tracking if it actually improves results.
Calliope’s AI Lab and Chat Studio support:
Query expansion is often the highest-ROI improvement you can make to RAG systems. Users ask in their language. Documents are written in theirs. Bridge the gap.

Your Terminal Just Got Superpowers Today we’re releasing Calliope CLI as open source. It’s a multi-model AI …

Understanding the Math Behind Modern AI Vector embeddings are everywhere in AI now. They power RAG systems, semantic …