preloader
blog post

Streaming Responses in AI Applications: Building Real-Time Conversational Interfaces

author image

Your Users Are Waiting for the Complete Response Before Seeing Anything

A user asks your AI chatbot a question. Nothing happens for 3 seconds. Then the complete 500-word response appears all at once.

They’ve already left.

This is the non-streaming problem. Most AI applications wait for the entire LLM response to generate before sending it to the user. For long responses, this creates an unbearable wait time.

Streaming solves this by sending the response token-by-token as it’s generated. The user sees text appearing in real-time, creating the perception of a responsive, intelligent system.

The difference is dramatic: 3 seconds of waiting vs. 3 seconds of reading as text appears.

Why Streaming Matters

The psychology of waiting:

  • No feedback: Waiting 3 seconds with no output feels like 10 seconds
  • With feedback: Watching text stream feels interactive, even if total time is the same
  • User engagement: Streaming keeps users engaged; they’re less likely to abandon

Real-world impact: A financial services chatbot switched from non-streaming to streaming responses. User satisfaction increased 40%. Abandonment rate (users leaving before getting a response) dropped 60%.

The technical advantage: Streaming isn’t just better UX—it’s better architecture:

  • Reduced latency perception: Users see first token in 100-200ms instead of waiting 3 seconds
  • Better memory usage: Don’t buffer entire response in memory
  • Faster time-to-first-token: Critical for user experience
  • Handles long responses: Can generate responses longer than context window

How Streaming Works

Non-streaming approach:

User: "Explain quantum computing"
         ↓
LLM generates entire response (3 seconds)
Response: "Quantum computing is..."
         ↓
Send complete response to user
         ↓
User sees: [3 second wait] → entire response appears

Streaming approach:

User: "Explain quantum computing"
         ↓
LLM starts generating response
         ↓
As each token is generated:
  - Send token to user immediately
  - User sees text appear in real-time
         ↓
User sees: [100ms] "Quantum" [50ms] " computing" [50ms] " is" ...
         ↓
Response complete, user has read most of it already

The technical flow:

Client                    Server                    LLM
  │                         │                        │
  ├─ POST /chat ────────────>│                        │
  │                         │                        │
  │                         ├─ Start generation ────>│
  │                         │                        │
  │                         │<─ token: "Quantum" ────┤
  │<─ stream: "Quantum" ─────┤                        │
  │                         │<─ token: " computing" ─┤
  │<─ stream: " computing" ──┤                        │
  │                         │<─ token: " is" ────────┤
  │<─ stream: " is" ─────────┤                        │
  │                         │                        │
  │                         │<─ [END] ───────────────┤
  │<─ stream: [END] ─────────┤                        │

Strategy 1: Server-Sent Events (SSE)

SSE is the simplest streaming approach for HTTP-based applications.

How SSE works:

from flask import Flask, Response
import json

@app.route('/chat', methods=['POST'])
def chat_stream():
    user_message = request.json['message']
    
    def generate():
        # Stream tokens from LLM
        for token in llm.stream(user_message):
            # Format as SSE
            yield f"data: {json.dumps({'token': token})}\n\n"
    
    return Response(
        generate(),
        mimetype='text/event-stream',
        headers={
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive'
        }
    )

Client-side JavaScript:

const eventSource = new EventSource('/chat?message=' + encodeURIComponent(message));
let response = '';

eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    response += data.token;
    
    // Update UI with streamed text
    document.getElementById('response').textContent = response;
};

eventSource.onerror = () => {
    eventSource.close();
};

Advantages:

  • Simple to implement
  • Works with standard HTTP
  • Good browser support
  • Automatic reconnection

Disadvantages:

  • One-way communication (server → client)
  • Can’t easily send updates from server to client
  • Limited to text/event-stream format
  • Some proxy/firewall issues

When to use:

  • Simple chat interfaces
  • Server needs to send updates to client
  • Don’t need bidirectional communication

Strategy 2: WebSockets

WebSockets provide full-duplex communication for more complex applications.

How WebSockets work:

from flask import Flask
from flask_socketio import SocketIO, emit

app = Flask(__name__)
socketio = SocketIO(app, cors_allowed_origins="*")

@socketio.on('chat_message')
def handle_chat(data):
    user_message = data['message']
    
    # Stream tokens to client
    for token in llm.stream(user_message):
        emit('token', {'token': token})
    
    # Signal completion
    emit('done')

Client-side JavaScript:

const socket = io();

socket.on('token', (data) => {
    response += data.token;
    document.getElementById('response').textContent = response;
});

socket.on('done', () => {
    console.log('Response complete');
});

// Send message
socket.emit('chat_message', { message: userInput });

Advantages:

  • Full-duplex communication
  • Lower latency than SSE
  • Can send data both ways
  • Better for real-time applications
  • Handles disconnections well

Disadvantages:

  • More complex to implement
  • Requires WebSocket library
  • Stateful connections (harder to scale)
  • More resource-intensive

When to use:

  • Complex interactive applications
  • Need bidirectional communication
  • Real-time collaboration features
  • Multiple message types

Strategy 3: HTTP/2 Server Push

HTTP/2 provides native streaming with better performance.

How HTTP/2 streaming works:

from quart import Quart, Response

app = Quart(__name__)

@app.route('/chat', methods=['POST'])
async def chat_stream():
    user_message = request.json['message']
    
    async def generate():
        # Stream tokens from LLM
        for token in llm.stream(user_message):
            yield f"data: {json.dumps({'token': token})}\n\n"
            await asyncio.sleep(0)  # Yield control
    
    return Response(
        generate(),
        mimetype='text/event-stream'
    )

Advantages:

  • Better performance than HTTP/1.1
  • Multiplexing support
  • Header compression
  • Server push capabilities

Disadvantages:

  • Requires HTTP/2 support
  • More complex to debug
  • Not all proxies support it

When to use:

  • High-performance applications
  • Multiple concurrent streams
  • Modern infrastructure

Strategy 4: gRPC Streaming

For high-performance systems and microservices.

gRPC protocol definition:

service ChatService {
  rpc StreamChat(ChatRequest) returns (stream ChatResponse);
}

message ChatRequest {
  string message = 1;
}

message ChatResponse {
  string token = 1;
  bool done = 2;
}

Python server:

from grpc import aio

class ChatServicer:
    async def StreamChat(self, request, context):
        for token in llm.stream(request.message):
            yield ChatResponse(token=token)
        yield ChatResponse(done=True)

async def serve():
    server = aio.server()
    ChatServicer().add_to_server(server)
    await server.start()

Client:

async with grpc.aio.secure_channel('localhost:50051', credentials) as channel:
    stub = ChatStub(channel)
    
    async for response in stub.StreamChat(ChatRequest(message=message)):
        if response.done:
            break
        print(response.token, end='', flush=True)

Advantages:

  • Highest performance
  • Language-agnostic
  • Strong typing
  • Excellent for microservices

Disadvantages:

  • Steeper learning curve
  • Requires gRPC infrastructure
  • Not ideal for web browsers
  • More complex debugging

When to use:

  • High-performance systems
  • Microservices architecture
  • Internal services
  • When performance is critical

Implementing Streaming: Best Practices

1. Buffering strategy

Don’t send every single token—batch them for efficiency:

def stream_with_buffering(llm_stream, buffer_size=5):
    buffer = []
    
    for token in llm_stream:
        buffer.append(token)
        
        if len(buffer) >= buffer_size:
            yield ''.join(buffer)
            buffer = []
    
    # Flush remaining tokens
    if buffer:
        yield ''.join(buffer)

Benefits:

  • Reduces overhead (fewer network packets)
  • Smoother text appearance
  • Better performance

Trade-off:

  • Slightly higher latency (100-200ms per batch)
  • Still much better than waiting for complete response

2. Error handling

Errors can occur mid-stream. Handle gracefully:

async def stream_with_error_handling(user_message):
    try:
        for token in llm.stream(user_message):
            yield token
    except LLMError as e:
        yield f"\n\n[Error: {str(e)}]"
    except Exception as e:
        yield f"\n\n[Unexpected error occurred]"
        logger.error(f"Streaming error: {e}")

3. Timeout handling

Long-running requests need timeouts:

async def stream_with_timeout(user_message, timeout=60):
    try:
        async with asyncio.timeout(timeout):
            async for token in llm.stream(user_message):
                yield token
    except asyncio.TimeoutError:
        yield "\n\n[Response generation timed out]"

4. Rate limiting

Prevent abuse of streaming endpoints:

from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=100, period=60)
@app.route('/chat', methods=['POST'])
def chat_stream():
    # Stream response
    pass

5. Monitoring and logging

Track streaming performance:

import time

async def stream_with_monitoring(user_message):
    start_time = time.time()
    token_count = 0
    
    try:
        for token in llm.stream(user_message):
            token_count += 1
            yield token
    finally:
        elapsed = time.time() - start_time
        tokens_per_second = token_count / elapsed
        
        metrics.record('streaming_latency', elapsed)
        metrics.record('tokens_per_second', tokens_per_second)
        logger.info(f"Stream complete: {token_count} tokens in {elapsed:.2f}s")

Frontend Implementation

React example with streaming:

import { useState } from 'react';

export function ChatInterface() {
    const [message, setMessage] = useState('');
    const [response, setResponse] = useState('');
    const [loading, setLoading] = useState(false);

    const handleSubmit = async (e) => {
        e.preventDefault();
        setLoading(true);
        setResponse('');

        try {
            const eventSource = new EventSource(
                `/api/chat?message=${encodeURIComponent(message)}`
            );

            eventSource.onmessage = (event) => {
                const data = JSON.parse(event.data);
                setResponse(prev => prev + data.token);
            };

            eventSource.onerror = () => {
                eventSource.close();
                setLoading(false);
            };
        } catch (error) {
            console.error('Streaming error:', error);
            setLoading(false);
        }
    };

    return (
        <div>
            <form onSubmit={handleSubmit}>
                <input
                    value={message}
                    onChange={(e) => setMessage(e.target.value)}
                    placeholder="Ask a question..."
                    disabled={loading}
                />
                <button type="submit" disabled={loading}>
                    {loading ? 'Streaming...' : 'Send'}
                </button>
            </form>
            <div className="response">
                {response}
                {loading && <span className="cursor">|</span>}
            </div>
        </div>
    );
}

Common Streaming Mistakes

Mistake 1: Sending every token individually Creates excessive network overhead. Batch tokens (5-10 per message).

Mistake 2: No timeout handling Stuck connections consume resources. Always set timeouts.

Mistake 3: Ignoring backpressure If client can’t keep up, buffer explodes. Implement proper flow control.

Mistake 4: Poor error handling Errors mid-stream leave users confused. Send clear error messages.

Mistake 5: Not monitoring performance Can’t improve what you don’t measure. Track latency and token throughput.

Streaming vs. Non-Streaming: When to Use Each

Use streaming when:

  • Responses are typically long (> 200 tokens)
  • User experience is critical
  • You want to reduce perceived latency
  • Building chat interfaces

Use non-streaming when:

  • Responses are short (< 50 tokens)
  • You need the complete response before processing
  • Implementing webhooks or background jobs
  • Integrating with systems expecting complete responses

Streaming in Calliope

AI Lab:

  • Experiment with streaming implementations
  • Test different buffering strategies
  • Benchmark SSE vs. WebSocket performance
  • Build custom streaming pipelines

Chat Studio:

  • Streaming enabled by default
  • Configurable token buffering
  • Automatic error handling
  • Real-time performance monitoring

Deep Agent:

  • Streaming tool execution results
  • Real-time action feedback
  • Progressive result updates

Langflow:

  • Visual streaming workflow builder
  • Configure buffering and timeouts
  • Monitor streaming performance

The Bottom Line

Streaming transforms the user experience from “waiting for a response” to “watching an intelligent system think in real-time.”

Implementation summary:

  1. Simple chat: Use SSE (Server-Sent Events)
  2. Interactive apps: Use WebSockets
  3. High performance: Use HTTP/2 or gRPC
  4. Always buffer: Send 5-10 tokens per message
  5. Always monitor: Track latency and throughput

Start with SSE. It’s simple, effective, and works for most use cases.

Build streaming AI interfaces with Calliope →

Related Articles