Production-Grade Conversational AI with Memory and Intelligent Response Modes
A full-stack AI chatbot featuring conversation memory, automatic mode detection, real-time logging, and analytics. Built to provide a personalized portfolio Q&A experience with enterprise-grade reliability.
Building an AI assistant isn't just about connecting to an API. A production-ready chatbot must handle context, provide varied responses, persist state, and offer observability—all while maintaining sub-2-second response times.
Design Goal:
Create a conversational AI that adapts its response style based on question type, remembers context across interactions, and provides production-grade logging—all while feeling instant and natural.
The AI automatically detects question intent and adapts its response style for optimal user experience.
• Deep-Dive
Trigger: Technical questions, "how does", "explain"
Returns detailed explanations with code snippets and architecture details
• Quick Answer
Trigger: Simple facts, "what is", "when"
Concise 1-2 sentence responses for fast answers
• Story Mode
Trigger: Personal questions, "why", "journey"
Engaging narrative responses about experiences and motivations
• Default
Trigger: General queries
Balanced informative responses
Implementation:
Mode detection uses Gemini's understanding of question patterns, combined with keyword analysis and conversation context. Each mode has custom system prompts optimized for that response style.
Conversations persist across page reloads and navigation, creating a continuous dialogue experience.
• Browser Storage
Session state saved in localStorage with conversation history
• Context Injection
Previous messages sent with each API call for continuity
• Smart Summarization
Long conversations auto-summarized to stay within token limits
Implementation:
Frontend maintains conversation array in localStorage. Backend receives full history and uses sliding window approach to keep recent context while respecting Gemini's token limits.
Every interaction is logged with rich metadata for monitoring, debugging, and improvement.
• Request Logging
Timestamp, user query, detected mode, session ID
• Response Metrics
Generation time, token count, response length
• Error Tracking
API failures, timeout events, rate limits
• Analytics Dashboard
Query patterns, popular questions, performance trends
Implementation:
Structured JSON logs written to file system with rotation. FastAPI middleware captures timing metrics. Future: Integration with Grafana/Prometheus for real-time monitoring.
Smart prompt engineering determines optimal response style
# FastAPI endpoint with mode detection
@app.post("/api/chat")
async def chat(request: ChatRequest):
user_message = request.message
history = request.history
# Detect response mode based on question patterns
mode = detect_mode(user_message)
# Build context-aware prompt
system_prompt = get_system_prompt(mode)
context = build_context(history)
# Call Gemini API
start_time = time.time()
response = await gemini_client.generate_content(
model="gemini-1.5-flash",
contents=[
{"role": "system", "parts": [system_prompt]},
{"role": "user", "parts": [context + user_message]}
]
)
# Log interaction
log_interaction(
user_message=user_message,
mode=mode,
response_time=time.time() - start_time,
tokens=response.usage_metadata.total_tokens
)
return {"response": response.text, "mode": mode}Conversation state management with localStorage
// ChatWidget component with session persistence
const [messages, setMessages] = useState([]);
// Load conversation on mount
useEffect(() => {
const savedMessages = localStorage.getItem('chat_history');
if (savedMessages) {
setMessages(JSON.parse(savedMessages));
}
}, []);
// Save after each message
useEffect(() => {
localStorage.setItem('chat_history', JSON.stringify(messages));
}, [messages]);
// Send message with full context
const sendMessage = async (userMessage) => {
const newMessage = { role: 'user', content: userMessage };
setMessages(prev => [...prev, newMessage]);
// Include conversation history for context
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({
message: userMessage,
history: messages // Full context sent to backend
})
});
const { response: aiResponse, mode } = await response.json();
setMessages(prev => [...prev, {
role: 'assistant',
content: aiResponse,
mode
}]);
};Gemini API calls can take 2-5 seconds for complex prompts. Users expect instant responses. Network latency and cold starts add additional delays.
Implemented optimistic UI (show typing indicator immediately), used Gemini 1.5 Flash (faster variant), reduced prompt length through smart summarization, added response streaming for progressive display, and cached common questions.
Average response time reduced from 3.8s to 1.2s. User-perceived latency feels instant due to optimistic UI and streaming.
Long conversations exceed Gemini's token limits. Sending full history with each request is expensive and slow.
Implemented sliding window approach: keep last 5 message pairs in full detail, summarize older messages into brief context, and use compression techniques for repetitive information.
99% of conversations stay within token limits without losing critical context. API costs reduced by 60%.
Need to monitor real-world usage, debug failures, and iterate based on actual user questions. Simple console.log isn't sufficient.
Built structured logging system with JSON output, implemented request/response tracking with unique session IDs, added error categorization (API failures vs. user errors), created log analysis scripts for insights.
Can identify popular questions, debug issues retroactively, and measure real response times in production. Informed 3 major feature improvements based on log analysis.
The chat widget you see in the bottom-right corner is this exact system in action. Ask it anything about my projects!