HLD-Bench Report

Generated on 2/8/2026, 3:19:56 PM · 7 run(s) across 1 problem(s)

Total Runs
7
Problems
1
Models
7
Providers
3

ChatGPT-like Conversational AI Web Application - High-Level Design

⏱ 355.8s 📅 2/8/2026, 2:32:04 PM 📁 design-chatgpt-claude-opus-4-6

Overview

This system is a large-scale conversational AI platform serving 20 million daily active users generating 500 million messages per day. The architecture follows a microservices pattern with clear separation between the real-time streaming layer, conversation management, LLM orchestration, and supporting services. The core design centers on a WebSocket-based streaming gateway that delivers token-by-token responses with sub-500ms time-to-first-token, backed by an LLM orchestration layer that abstracts multiple model backends (OpenAI, Anthropic, self-hosted) with automatic failover. Conversations are persisted in a sharded PostgreSQL cluster for immediate consistency, with Redis caching for hot conversation context, and S3 for file/multimodal uploads. The system is designed for multi-region deployment with regional WebSocket gateways, global CDN for static assets, and a robust rate-limiting and billing pipeline that tracks per-request token costs. Key architectural decisions include using Server-Sent Events (SSE) over WebSocket for streaming simplicity, CQRS for separating write-heavy message ingestion from read-heavy history/search workloads, and an event-driven architecture via Kafka for decoupling billing, analytics, and audit concerns from the critical path. The admin dashboard is powered by a dedicated analytics pipeline built on ClickHouse for real-time usage monitoring and cost attribution.

Requirements

Functional

  • User registration, authentication (email, OAuth), and session management with JWT tokens
  • Create, continue, and manage multi-turn conversation threads with full context retention
  • Real-time streaming of LLM responses token-by-token to the client
  • Persistent conversation history with full-text search and folder/tag organization
  • Model selection allowing users to choose between different LLM backends per conversation
  • File upload support for images, PDFs, and documents with multimodal input to LLMs
  • Share conversations via unique public links with optional expiration
  • Admin dashboard for monitoring usage metrics, costs, active users, and system health
  • Rate limiting and tiered usage quotas (free, plus, enterprise) with enforcement
  • Markdown rendering support in responses including code blocks, tables, LaTeX, and syntax highlighting

Non-Functional

  • Time to first token must be under 500ms for streaming responses
  • Support at least 100K concurrent WebSocket/SSE connections per region
  • Conversation history must be durable with immediate consistency (no eventual consistency for user-facing reads)
  • Handle LLM backend failures with automatic failover to alternative providers within 2 seconds
  • Per-request cost tracking for accurate billing with less than 0.1% error rate
  • 99.95% availability SLA for the overall platform
  • Horizontal scalability to handle 500M messages/day (~5,800 messages/sec average, 20K+ peak)
  • P99 API response latency under 200ms for non-LLM endpoints (history, search, auth)
  • Data encryption at rest and in transit, SOC2 compliance readiness
  • Multi-region deployment with data residency compliance for EU/US users
  • Graceful degradation under load — queue overflow should return informative wait messages rather than errors

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
API GatewayKong Gateway (on Kubernetes)Entry point for all client requests. Handles TLS termination, request routing, authentication verification, rate limiting enforcement, and load balancing across backend services.Kong provides built-in rate limiting, JWT validation, request transformation, and plugin ecosystem. It handles both HTTP and WebSocket upgrade requests, supports declarative config via Kubernetes CRDs, and scales horizontally. Preferred over AWS API Gateway for lower latency and more control over WebSocket handling.
Streaming GatewayCustom Go service with nhooyr/websocketManages long-lived SSE/WebSocket connections for real-time token streaming from LLM backends to clients. Handles connection lifecycle, heartbeats, backpressure, and reconnection.Go excels at handling massive concurrent connections with minimal memory overhead (goroutines use ~4KB vs threads). A custom service allows precise control over backpressure, connection draining, and graceful failover. Each instance can handle 50K+ concurrent connections, needing only 2-3 instances per region for 100K target.
Auth ServiceNode.js with Passport.js + Redis session storeUser registration, login (email/password, Google OAuth, GitHub OAuth), JWT issuance and refresh, session management, and password reset flows.Passport.js has mature OAuth provider integrations. Node.js is well-suited for I/O-bound auth workflows. Redis stores refresh tokens and session blacklists for O(1) lookups. JWTs are short-lived (15min) with Redis-backed refresh tokens for revocation capability.
Conversation ServicePython (FastAPI)Core business logic for creating conversations, appending messages, managing conversation metadata (titles, folders, tags), and serving conversation history with pagination.FastAPI provides async support, automatic OpenAPI docs, and excellent Python ecosystem integration for ML/AI tooling. Python aligns with the broader AI/ML ecosystem making it easy to integrate tokenizers, prompt engineering libraries, and model-specific utilities.
LLM OrchestratorPython (FastAPI) with LiteLLMAbstracts multiple LLM backends, handles model routing based on user selection, manages prompt assembly with conversation context, implements retry/failover logic, and streams tokens back to the Streaming Gateway.LiteLLM provides a unified interface to 100+ LLM providers (OpenAI, Anthropic, Cohere, self-hosted vLLM). FastAPI's async streaming support enables efficient token forwarding. The orchestrator implements circuit breaker patterns per backend and automatic failover when a provider returns errors or exceeds latency thresholds.
File Processing ServicePython with Celery workersHandles file upload, validation, virus scanning, format conversion, image resizing, OCR for documents, and preparing multimodal inputs for LLM consumption.File processing is CPU-intensive and variable in duration — Celery workers can scale independently. Python has excellent libraries for image processing (Pillow), PDF extraction (PyMuPDF), and OCR (Tesseract). Workers pull from a Redis-backed task queue for reliable processing.
Search ServiceElasticsearch 8.xFull-text search across conversation history, semantic search for finding relevant past conversations, and powering the organization/filtering UI.Elasticsearch provides fast full-text search with relevance scoring, supports nested document structures ideal for conversations with messages, and offers built-in vector search (kNN) for semantic search. The inverted index is highly optimized for the search-heavy read pattern of conversation history.
Rate Limiter & Quota ServiceRedis Cluster with Lua scriptsEnforces per-user, per-tier rate limits (requests/min, tokens/day), tracks usage quotas, and signals the API gateway to throttle or reject requests.Redis provides sub-millisecond rate limit checks using sliding window counters implemented via Lua scripts for atomicity. Redis Cluster enables horizontal scaling. Token bucket and sliding window algorithms are implemented for different rate limiting needs (burst vs sustained).
Billing & Cost Tracking ServiceGo service consuming from KafkaRecords per-request token usage and costs, aggregates billing data per user/organization, generates invoices, and feeds cost data to the admin dashboard.Go provides the performance needed for high-throughput event processing. Kafka consumption decouples billing from the critical request path — if billing is slow, it doesn't affect user experience. Go's strong typing and low GC pauses ensure accurate, reliable cost aggregation at 500M messages/day.
Admin Dashboard BackendNode.js (Express) + ClickHouse queriesServes aggregated analytics, real-time usage metrics, cost reports, user management, system health monitoring, and model performance dashboards.Node.js is efficient for the I/O-bound dashboard API pattern. ClickHouse provides sub-second analytical queries over billions of rows for real-time dashboards. The admin backend is a lightweight API layer that translates dashboard queries into optimized ClickHouse SQL.
CDN & FrontendCloudFront CDN + Next.js (React)Serves the React-based SPA, handles static assets, and provides edge caching for shared conversation pages.Next.js provides SSR for shared conversation pages (SEO, social previews), static generation for marketing pages, and CSR for the interactive chat UI. CloudFront provides global edge caching with ~20ms latency to users worldwide. React's ecosystem has excellent Markdown rendering libraries (react-markdown, react-syntax-highlighter).
Event BusApache Kafka (MSK)Decouples services by publishing domain events (message_created, conversation_shared, tokens_consumed) for downstream consumers like billing, analytics, search indexing, and notifications.Kafka handles the 500M+ events/day throughput with ease, provides exactly-once semantics for billing accuracy, supports multiple consumer groups (billing, analytics, search indexer), and offers configurable retention for replay capability. MSK reduces operational burden.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
PostgreSQL (Citus)sqlPrimary data store for users, conversations, messages, and billing records. Citus extension enables horizontal sharding by user_id, distributing the 500M messages/day write load across multiple nodes while maintaining strong consistency and ACID transactions within a user's data. Immediate consistency requirement rules out eventually-consistent NoSQL options. Sharding by user_id ensures all conversation data for a user is co-located for efficient joins and queries.
Redis ClustercacheMulti-purpose caching layer: (1) Conversation context cache — stores the last N messages of active conversations to avoid DB reads on every LLM request, reducing P99 latency. (2) Session/JWT blacklist store for auth. (3) Rate limiting counters with atomic Lua scripts. (4) Celery task broker for file processing. Redis Cluster provides automatic partitioning across 6+ nodes with built-in failover.
Elasticsearch 8.xsearchPowers full-text search across conversation history with BM25 relevance scoring and supports vector search (kNN) for semantic similarity. Conversations are indexed asynchronously via Kafka consumers, so search indexing doesn't block the critical write path. Supports nested documents for conversation-message hierarchy and faceted filtering by date, model, folder.
Amazon S3blobStores uploaded files (images, PDFs, documents) and conversation export archives. S3 provides 11 nines of durability, lifecycle policies for cost optimization (move old files to Glacier), and presigned URLs for secure direct client uploads. Multipart upload support handles large files efficiently.
Apache Kafka (MSK)queueEvent streaming backbone carrying domain events (message_created, tokens_consumed, file_uploaded, conversation_shared) to downstream consumers. Kafka's partitioned log model supports parallel consumption by billing, search indexer, and analytics pipelines independently. At 500M messages/day, Kafka's throughput (millions of msgs/sec per cluster) provides massive headroom. Exactly-once semantics ensure billing accuracy.
ClickHousesqlColumnar OLAP database for real-time analytics powering the admin dashboard. Handles aggregation queries over billions of events (messages, token usage, costs) with sub-second response times. MergeTree engine provides efficient time-series storage with automatic data compaction. Chosen over Redshift for lower latency on interactive queries and over Druid for simpler operations.

API Design

MethodEndpointDescription
POST/api/v1/auth/loginAuthenticate user with email/password or OAuth token. Returns short-lived JWT access token (15min) and long-lived refresh token. Sets secure httpOnly cookie for refresh token.
POST/api/v1/auth/refreshExchange a valid refresh token for a new JWT access token. Implements refresh token rotation — old token is invalidated in Redis upon use.
POST/api/v1/conversationsCreate a new conversation thread. Accepts optional model selection, system prompt, and folder assignment. Returns conversation_id and initial metadata.
GET/api/v1/conversationsList user's conversations with pagination, filtering (by folder, date range, model), and sorting. Returns conversation metadata including title, last message timestamp, message count, and model used.
POST/api/v1/conversations/{conversation_id}/messagesSend a new user message to a conversation. Triggers LLM completion. Returns message_id and a stream_url for the client to connect to for receiving the streamed response. Accepts optional file attachments by reference (file_ids from upload).
GET/api/v1/conversations/{conversation_id}/messagesRetrieve paginated message history for a conversation. Supports cursor-based pagination (before/after message_id). Returns messages with role, content, timestamp, token count, and model info.
WS/api/v1/stream/{message_id}Server-Sent Events (SSE) endpoint for streaming LLM response tokens. Client connects after sending a message. Receives token-by-token events, metadata events (model, token count), and a final done event with complete message and usage stats.
POST/api/v1/files/uploadUpload a file (image, PDF, document) for use in conversations. Returns a presigned S3 URL for direct upload and a file_id for referencing in messages. Validates file type and size limits per user tier.
POST/api/v1/conversations/{conversation_id}/shareGenerate a public sharing link for a conversation. Accepts optional expiration time and whether to include future messages. Returns a unique share URL that can be accessed without authentication.
GET/api/v1/searchFull-text search across user's conversation history. Accepts query string, filters (date range, model, folder), and pagination. Returns matching conversations and message snippets with highlighted matches.
PATCH/api/v1/conversations/{conversation_id}Update conversation metadata including title, folder assignment, tags, pinned status, and archive status. Supports partial updates.
DELETE/api/v1/conversations/{conversation_id}Soft-delete a conversation and all its messages. Data is retained for 30 days before permanent deletion. Triggers cleanup of associated search index entries and cached context.
GET/api/v1/user/usageRetrieve current user's usage statistics including tokens consumed today/this month, message count, rate limit status, and quota remaining for their tier.
GET/api/v1/admin/dashboard/metricsAdmin-only endpoint returning aggregated platform metrics: DAU, messages/hour, token costs by model, error rates, P99 latencies, active connections, and top users by usage. Powered by ClickHouse queries.

Scalability Strategy

The system employs a multi-layered horizontal scaling strategy designed to handle 20M DAU and 500M messages/day with significant headroom: **Compute Scaling (Kubernetes):** All core services run on Kubernetes (EKS) with Horizontal Pod Autoscaler (HPA) based on CPU, memory, and custom metrics (active connections for Streaming Gateway, queue depth for File Processing). The Streaming Gateway scales based on active WebSocket connections with a target of 40K connections per pod (Go's goroutine efficiency allows this). The LLM Orchestrator scales based on in-flight requests to LLM backends. **Database Scaling (Citus Sharded PostgreSQL):** Conversations and messages are sharded by user_id using Citus, distributing data across 32+ worker nodes. This ensures all data for a single user is co-located (avoiding cross-shard queries) while distributing the 500M daily message writes evenly. Read replicas per shard handle read-heavy workloads (conversation history browsing). Connection pooling via PgBouncer (256 connections per pool) prevents connection exhaustion. **Caching Strategy:** Redis Cluster with 12+ nodes provides the caching layer. Active conversation contexts (last 10 messages) are cached with 1-hour TTL, eliminating ~80% of database reads for the hot path (LLM context assembly). Cache-aside pattern with write-through for conversation metadata ensures consistency. **Event Processing Scaling:** Kafka topics are partitioned by user_id (128 partitions per topic), allowing consumer groups to scale horizontally. Billing consumers run 32 instances processing events in parallel. Search indexer runs 16 instances with bulk indexing to Elasticsearch. **Multi-Region Deployment:** The system deploys in US-East, US-West, and EU-West regions. Each region has its own Streaming Gateway fleet, Kong Gateway, and Redis cache. PostgreSQL uses Citus with the primary write cluster in one region and fast read replicas in others. For users requiring data residency (EU), a fully independent EU cluster is maintained. Global traffic routing via Route53 latency-based routing directs users to the nearest region. **LLM Backend Scaling:** The LLM Orchestrator implements a weighted round-robin across multiple API keys per provider, connection pooling to self-hosted vLLM instances (which auto-scale GPU nodes based on queue depth), and circuit breakers per backend. Self-hosted vLLM runs on p4d.24xlarge instances with auto-scaling groups targeting 70% GPU utilization. **CDN and Static Scaling:** CloudFront serves all static assets and SSR pages from 400+ edge locations. Shared conversation pages are cached at the edge with 5-minute TTL and cache invalidation on update. **Graceful Degradation:** Under extreme load, the system implements progressive degradation: (1) reduce max context window length, (2) disable search indexing temporarily, (3) queue non-streaming requests, (4) serve cached responses for identical recent queries, (5) display wait queue UI rather than errors.

Trade-offs

SSE (Server-Sent Events) for streaming instead of pure WebSocket

  • SSE works over standard HTTP/2 — no special proxy configuration needed, works through all CDNs and load balancers
  • Automatic reconnection built into the EventSource API with last-event-id support
  • Simpler server implementation — unidirectional stream matches the LLM response pattern
  • Better compatibility with HTTP-based auth (cookies, headers) without custom handshake logic
  • Easier to load balance since connections are standard HTTP
  • Unidirectional — cannot send client messages over the same connection (requires separate POST requests)
  • Limited to ~6 concurrent connections per domain in HTTP/1.1 (mitigated by HTTP/2 multiplexing)
  • No binary frame support — all data must be text-encoded (acceptable for token streaming)
  • Some older corporate proxies may buffer SSE events (mitigated by including periodic comments as keep-alive)

PostgreSQL with Citus sharding instead of NoSQL (DynamoDB/Cassandra)

  • Strong consistency guarantees satisfy the immediate consistency requirement for conversation history
  • SQL expressiveness enables complex queries for search, filtering, and admin analytics without separate ETL
  • ACID transactions ensure message ordering and conversation integrity
  • Citus provides horizontal scaling while preserving PostgreSQL's full feature set (JSONB, CTEs, window functions)
  • Existing team expertise with PostgreSQL reduces operational risk
  • Cross-shard queries (e.g., global admin analytics) are more expensive than single-shard queries
  • Schema migrations on sharded tables require careful coordination
  • Higher operational complexity compared to fully managed DynamoDB
  • Connection management requires PgBouncer pooling layer adding another component

Kafka as event bus instead of simpler alternatives (RabbitMQ, SQS)

  • Supports multiple independent consumer groups — billing, search, analytics all consume the same events
  • Message replay capability enables reprocessing if a consumer fails or needs reindexing
  • Exactly-once semantics (with idempotent producers) critical for billing accuracy
  • Partitioned log model handles 500M+ events/day with low latency
  • MSK (managed) reduces operational overhead
  • Higher complexity than SQS/RabbitMQ — requires understanding of partitions, consumer groups, offsets
  • Minimum 3-broker cluster even for dev/staging environments increases infrastructure cost
  • Message ordering only guaranteed within a partition (mitigated by partitioning by user_id)
  • Consumer lag monitoring and rebalancing require operational attention

LiteLLM as the unified LLM abstraction layer

  • Single interface to 100+ LLM providers reduces integration code significantly
  • Built-in retry logic, streaming support, and token counting per provider
  • Easy to add new model backends without changing orchestration code
  • Active open-source community with frequent updates for new models
  • Additional abstraction layer adds latency (~5-10ms) to every LLM call
  • May not expose provider-specific optimizations or features immediately
  • Dependency on third-party library for critical path — must pin versions carefully
  • Custom failover logic still needed on top of LiteLLM's built-in retries

Separate Streaming Gateway service (Go) from Conversation Service (Python)

  • Go handles 50K+ concurrent connections per instance with minimal memory — dramatically reduces infrastructure cost for the connection-heavy streaming workload
  • Independent scaling — streaming connections scale differently from CRUD API operations
  • Fault isolation — a crash in conversation logic doesn't drop active streams
  • Go's deterministic low-latency GC prevents stream stuttering
  • Two services to maintain for what is conceptually one user action (send message + receive stream)
  • Coordination complexity — the Conversation Service must signal the Streaming Gateway when to start streaming
  • Different programming languages increase team skill requirements
  • Additional network hop between Conversation Service and Streaming Gateway adds ~2ms latency

ClickHouse for analytics instead of extending PostgreSQL or using a data warehouse

  • Columnar storage provides 10-100x faster analytical queries compared to row-based PostgreSQL
  • Sub-second query performance on billions of rows enables real-time admin dashboards
  • Excellent compression (10-20x) reduces storage costs for high-volume event data
  • Native support for time-series aggregations and materialized views for pre-computed metrics
  • Another database system to operate and monitor
  • Not suitable for transactional workloads — purely append-optimized
  • Limited UPDATE/DELETE capabilities make data corrections cumbersome
  • Requires separate data pipeline from Kafka to ingest events

Soft-delete conversations with 30-day retention instead of immediate hard delete

  • Users can recover accidentally deleted conversations within the retention window
  • Simplifies billing audits — all data is available for the billing period
  • Background cleanup job can batch-delete efficiently during off-peak hours
  • Compliance teams can review data before permanent deletion if needed
  • Increases storage costs — deleted data occupies space for 30 days
  • All queries must filter on is_deleted flag adding minor overhead
  • GDPR right-to-erasure may require faster hard deletion for EU users — requires a separate expedited deletion pipeline
  • Search index must also handle soft-delete filtering
⏱ 342.8s 📅 2/8/2026, 2:24:03 PM 📁 design-chatgpt-claude-sonnet-4-5

Overview

This system design outlines a globally distributed, highly scalable conversational AI platform capable of serving 20 million daily active users with 500 million messages per day. The architecture employs a microservices approach with dedicated services for authentication, conversation management, real-time streaming, and LLM orchestration. The design emphasizes low-latency streaming responses (sub-500ms time to first token), horizontal scalability to support 100K+ concurrent WebSocket connections per region, and robust fault tolerance with automatic LLM backend failover. The system leverages a multi-region deployment with geographic load balancing, employs PostgreSQL with read replicas for durable conversation storage, Redis for session management and caching, and Kafka for asynchronous event processing. A dedicated LLM Gateway service abstracts multiple LLM providers (OpenAI, Anthropic, custom models), implements intelligent routing, rate limiting, and cost tracking. Real-time bidirectional communication is handled via WebSocket connections through a scalable connection manager, while a CDN delivers static assets and cached content globally.

Requirements

Functional

  • User registration, login, and session management with JWT tokens
  • Create, read, update, and delete conversation threads
  • Multi-turn conversations with full context retention across messages
  • Real-time streaming of LLM responses token-by-token via WebSocket
  • Support for multiple LLM backends with user-selectable models
  • Conversation history search and organization (folders, tags, timestamps)
  • File upload and multimodal input processing (images, PDFs, documents)
  • Generate shareable public links for conversations with privacy controls
  • Markdown rendering support including code syntax highlighting
  • Admin dashboard for usage analytics, cost monitoring, and user management
  • Rate limiting based on user tier (free, pro, enterprise)
  • Usage quota enforcement and billing integration

Non-Functional

  • Support 20 million daily active users and 500 million messages/day
  • Time to first token (TTFT) must be under 500ms
  • Handle 100K+ concurrent WebSocket connections per region
  • 99.9% availability with automatic failover for LLM backend failures
  • Conversation data must be immediately consistent and durable
  • Horizontal scalability for all stateless services
  • Geographic distribution across multiple regions for low latency
  • Per-request cost tracking with 99.99% accuracy for billing
  • Support message throughput of 5,800 messages/second sustained
  • Data retention for at least 90 days with archival for older conversations
  • Security compliance (encryption at rest and in transit, GDPR, SOC2)

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
API GatewayKong Gateway with OpenResty (Nginx + Lua)Single entry point for all client requests; handles routing, authentication validation, rate limiting, request/response transformation, and SSL terminationKong provides high-performance reverse proxy with built-in plugins for authentication, rate limiting, logging, and circuit breaking. Handles 10K+ RPS per instance with horizontal scalability and proven in production at scale.
Authentication ServiceNode.js with Passport.js + Auth0 for identity managementUser registration, login, JWT token issuance and validation, OAuth integration, session management, and user profile managementAuth0 provides enterprise-grade authentication with built-in security features, MFA, social login, and scales automatically. Node.js offers fast token validation and can handle 5K+ auth requests per second per instance.
WebSocket Connection ManagerGo with Gorilla WebSocket library, deployed on Kubernetes with HPAMaintains persistent WebSocket connections, handles connection lifecycle, message routing, presence management, and broadcasts streaming responses to clientsGo excels at concurrent connection handling with lightweight goroutines. Each instance can handle 10K+ concurrent WebSockets with minimal memory overhead. Stateless design allows horizontal scaling based on connection count.
Conversation ServiceJava Spring Boot with Spring Data JPACRUD operations for conversation threads, message persistence, context window management, conversation search, and thread organizationSpring Boot provides mature transaction management, excellent PostgreSQL integration, and strong consistency guarantees. JPA simplifies complex queries for conversation history and search. Battle-tested at enterprise scale.
LLM Gateway ServicePython with FastAPI and LangChain for LLM orchestrationAbstracts multiple LLM providers, routes requests to appropriate backends, handles streaming, implements retry logic with exponential backoff, tracks costs per request, and provides automatic failoverPython ecosystem has best LLM library support (OpenAI SDK, Anthropic SDK, transformers). FastAPI provides async streaming support essential for token-by-token delivery. LangChain simplifies multi-provider integration and context management.
File Processing ServicePython with Celery for async processing, Tesseract for OCR, PyPDF2 for PDF parsingHandles file uploads, validates file types and sizes, extracts text from documents (OCR, PDF parsing), processes images for vision models, and stores files in object storagePython has rich libraries for document processing and image manipulation. Celery provides distributed task queue for async processing of large files without blocking API responses. Can scale workers independently based on queue depth.
Search ServiceElasticsearch with custom analyzers for semantic searchIndexes conversation content, provides full-text search across message history, supports filtering by date, model, and tagsElasticsearch provides sub-second full-text search across billions of documents. Supports complex queries, filtering, and aggregations. Can be extended with vector embeddings for semantic search. Scales horizontally with sharding.
Rate Limiter ServiceRedis with Lua scripts for atomic rate limiting operationsEnforces per-user and per-tier rate limits, quota management, token bucket algorithm implementation, and communicates with billing serviceRedis provides in-memory performance (<1ms latency) essential for rate limit checks on every request. Lua scripts ensure atomic operations for token bucket algorithms. Redis Cluster provides high availability and scales to millions of users.
Analytics & Monitoring ServiceClickHouse for OLAP analytics with Grafana for visualizationCollects usage metrics, tracks costs per request and per user, monitors system health, generates reports for admin dashboardClickHouse excels at high-volume time-series analytics with billions of rows, providing sub-second query performance for dashboards. Columnar storage reduces costs. Grafana provides rich visualization for admin dashboards.
Notification ServiceNode.js with SendGrid for email, Firebase Cloud Messaging for pushSends email notifications, push notifications, and in-app alerts for quota limits, system updates, and shared conversationsSendGrid provides reliable email delivery with analytics. FCM supports cross-platform push notifications. Node.js event-driven architecture handles high-volume async notifications efficiently.
Share ServiceGo with Redis for link metadata cachingGenerates unique shareable links for conversations, manages privacy settings and expiration, renders public conversation viewsGo provides fast link generation and validation. Redis caches share metadata to avoid database lookups on every public link access. Stateless design allows easy scaling for viral shared conversations.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
PostgreSQL 15 with Citus extension for horizontal shardingsqlPrimary datastore for users, conversations, messages, and relationships. Citus enables horizontal sharding by user_id to handle billions of messages. JSONB support for flexible message metadata. Strong ACID guarantees ensure conversation consistency. Read replicas handle query load.
Redis ClustercacheMulti-purpose: JWT session storage, rate limiting counters, conversation context caching, WebSocket connection metadata, and hot conversation cache. Sub-millisecond latency critical for rate limiting and session validation. Redis Cluster provides automatic sharding and replication.
Elasticsearch 8.xsearchFull-text search across conversation history. Handles complex queries with filters, highlighting, and relevance scoring. Inverted indexes provide fast search across billions of messages. Can be extended with kNN for semantic search using embeddings.
Amazon S3 with CloudFront CDNblobStores uploaded files (images, documents), exported conversations, and shared conversation snapshots. S3 provides 99.999999999% durability, lifecycle policies for cost optimization, and versioning. CloudFront accelerates file delivery globally.
Apache KafkaqueueEvent streaming backbone for async processing: analytics events, usage tracking, cost calculation, audit logs, and notification triggers. Kafka provides durable message storage, replay capability, and scales to millions of events per second. Decouples producers from consumers.
ClickHousenosqlTime-series analytics database for usage metrics, cost tracking, and admin dashboards. Optimized for OLAP queries with aggregations across billions of rows. Columnar storage provides 10-100x compression. Real-time ingestion from Kafka.

API Design

MethodEndpointDescription
POST/api/v1/auth/registerRegister a new user account with email and password, returns JWT access and refresh tokens
POST/api/v1/auth/loginAuthenticate user credentials and issue JWT tokens with user tier information
POST/api/v1/conversationsCreate a new conversation thread, returns conversation_id and initial metadata
GET/api/v1/conversations/{conversation_id}Retrieve full conversation thread with all messages, supports pagination and filtering
POST/api/v1/conversations/{conversation_id}/messagesSend a new message in a conversation, triggers LLM processing, returns message_id for tracking
WS/ws/v1/streamWebSocket endpoint for real-time bidirectional communication, streams LLM responses token-by-token, handles connection lifecycle
GET/api/v1/conversations/searchFull-text search across user's conversation history with filters for date range, model, and tags
POST/api/v1/files/uploadUpload files for multimodal input, supports images and documents up to 50MB, returns file_id and processing status
POST/api/v1/conversations/{conversation_id}/shareGenerate a public shareable link for a conversation with configurable expiration and privacy settings
GET/api/v1/modelsList available LLM models with capabilities, pricing, and context window information
GET/api/v1/users/me/usageGet current user's usage statistics, quota consumption, and rate limit status
GET/api/v1/admin/analytics/usageAdmin endpoint for aggregated usage metrics, costs by model, and active user statistics
DELETE/api/v1/conversations/{conversation_id}Soft delete a conversation thread, marks as deleted but retains for recovery period

Scalability Strategy

**Horizontal Scaling Approach:** 1. **Stateless Services**: All application services (API Gateway, Conversation Service, LLM Gateway, Auth Service, WebSocket Manager) are stateless and containerized with Kubernetes. Auto-scaling policies based on CPU (70% threshold) and custom metrics (concurrent connections for WS Manager, queue depth for File Processing). 2. **WebSocket Connection Distribution**: Each WebSocket Manager instance handles 10K concurrent connections. With 100K target per region, deploy 10+ instances with sticky session routing at the load balancer level using consistent hashing on user_id. Connection metadata stored in Redis allows any instance to route messages. 3. **Database Sharding**: PostgreSQL with Citus extension shards data by user_id across 16 initial shards, expandable to 64+. Each shard handles ~1.25M users. Read replicas (3 per shard) distribute query load. Message tables partitioned by created_at (monthly) for efficient archival. 4. **LLM Gateway Scaling**: Python FastAPI instances scaled based on request queue depth in Kafka. Each instance maintains connection pools to external LLM APIs (OpenAI, Anthropic) with circuit breakers. Geographic proximity routing to LLM endpoints reduces latency. 5. **Caching Strategy**: Redis Cluster with 12 nodes (4 shards × 3 replicas) caches: conversation contexts (30min TTL), user sessions (24hr), rate limit counters (1hr sliding window), hot conversations (top 10% by access). Cache hit rate target: 85%+. 6. **Multi-Region Deployment**: Deploy across 3 regions (US-East, EU-West, Asia-Pacific) with Route53 geo-routing. Each region handles 7M DAU. Cross-region PostgreSQL replication (async) for disaster recovery. Kafka MirrorMaker 2 replicates events for analytics aggregation. **Vertical Scaling Considerations:** - PostgreSQL instances: Start with r6g.4xlarge (16 vCPU, 128GB RAM), scale to r6g.8xlarge for primary. Read replicas on r6g.2xlarge. - Redis Cluster: r6g.xlarge nodes (4 vCPU, 32GB RAM per node). - LLM Gateway: CPU-optimized c6i.2xlarge for fast Python execution. - ClickHouse: Storage-optimized i3en.2xlarge for cost-effective analytics. **Capacity Planning for 500M messages/day**: ~5,800 msgs/sec sustained, 12K msgs/sec peak. Each LLM Gateway instance handles 50 concurrent requests × 20 regions × 10 instances = 10K concurrent LLM requests. Over-provision by 50% for traffic spikes and failover capacity.

Trade-offs

WebSocket for real-time streaming vs Server-Sent Events (SSE)

  • Bidirectional communication allows client to cancel requests mid-stream
  • Lower latency for streaming tokens (no HTTP overhead per message)
  • Better for interactive features like typing indicators and presence
  • Single persistent connection reduces connection overhead
  • More complex infrastructure with stateful connection management
  • Requires sticky sessions and connection state tracking in Redis
  • Harder to debug and monitor compared to stateless HTTP
  • Load balancer configuration more complex (TCP vs HTTP)
  • Higher memory consumption per connection on server side

PostgreSQL with Citus sharding vs fully distributed database (Cassandra/DynamoDB)

  • Strong ACID guarantees ensure conversation consistency across multi-turn interactions
  • Complex relational queries for conversation threads, user relationships, and search
  • Mature ecosystem with excellent tooling, monitoring, and operational knowledge
  • JSONB support provides schema flexibility for message metadata without sacrificing SQL
  • Citus provides transparent sharding while maintaining PostgreSQL compatibility
  • Harder to scale writes compared to eventually consistent NoSQL databases
  • Requires careful shard key selection (user_id) to avoid hot partitions
  • Cross-shard queries (e.g., admin analytics) are more expensive
  • Higher operational complexity for managing sharding compared to managed NoSQL
  • Potential single points of failure if primary shard goes down (mitigated with replicas)

Python FastAPI for LLM Gateway vs Go/Java

  • Best ecosystem for LLM libraries (OpenAI, Anthropic, LangChain, transformers)
  • Native async/await support in FastAPI ideal for streaming responses
  • Rapid development and easy integration with ML/AI tooling
  • LangChain provides abstraction for multi-provider LLM orchestration
  • Python's expressiveness reduces code complexity for prompt engineering
  • Lower raw throughput compared to Go or Java (GIL limitations)
  • Higher memory consumption per request (~50MB vs ~5MB for Go)
  • Slower cold start times if using serverless deployment
  • Requires more instances to achieve same throughput as compiled languages
  • Dependency management more fragile (pip vs Go modules)

Kafka for event streaming vs direct database writes with triggers

  • Decouples message processing from analytics, allowing independent scaling
  • Event replay capability for backfilling analytics or debugging
  • Enables multiple consumers (analytics, search indexing, notifications) without coupling
  • Buffer for traffic spikes - prevents overwhelming downstream systems
  • Provides audit log for compliance and debugging
  • Additional infrastructure complexity and operational overhead
  • Eventual consistency - analytics may lag real-time by seconds
  • Higher storage costs for event retention (30 days = ~15TB for 500M msgs/day)
  • Requires monitoring for consumer lag and rebalancing
  • Increases overall system latency for end-to-end event processing

Multi-region active-active deployment vs active-passive

  • Lower latency for global users by serving from nearest region
  • Higher availability - no failover delay if region goes down
  • Better resource utilization - all regions handle traffic simultaneously
  • Enables geographic compliance (EU data stays in EU region)
  • Complex data consistency challenges for global user state (sessions, rate limits)
  • Higher infrastructure costs running full stack in multiple regions
  • Cross-region latency for database replication can cause eventual consistency issues
  • More complex deployment and testing (must validate across all regions)
  • Potential for split-brain scenarios if network partitions occur

Redis for rate limiting vs database-based rate limiting

  • Sub-millisecond latency critical for checking limits on every request
  • Atomic operations with Lua scripts prevent race conditions in token bucket algorithm
  • In-memory performance scales to millions of rate limit checks per second
  • TTL support automatically cleans up expired rate limit windows
  • Doesn't add load to primary transactional database
  • Data loss risk if Redis instance fails (mitigated with Redis Cluster replication)
  • Higher cost per GB compared to disk-based storage
  • Requires separate infrastructure component to maintain and monitor
  • Complex to guarantee exactly-once semantics during failover scenarios
  • Memory constraints may require eviction policies that could lose rate limit state
⏱ 39.4s 📅 2/8/2026, 1:52:44 PM 📁 design-chatgpt-gemini-3-flash-preview

Overview

This system design describes a globally distributed, high-concurrency platform similar to ChatGPT, capable of handling 20M DAU and 500M messages per day. The architecture focuses on low-latency streaming (TTFT < 500ms), immediate consistency for conversation history, and high availability across multiple LLM backends through an intelligent inference orchestration layer. It utilizes an event-driven model for background tasks like cost tracking and search indexing, while maintaining persistent connections for real-time interaction.

Requirements

Functional

  • User authentication and session management
  • Multi-turn conversation with stateful context management
  • Real-time token streaming via Server-Sent Events (SSE)
  • Global conversation search and organization
  • Support for multiple LLM providers (OpenAI, Anthropic, internal models)
  • Multimodal support (Image/Document processing)
  • Public conversation sharing via UUID-masked URLs
  • Admin monitoring for cost and model performance

Non-Functional

  • Scale: 20 million daily active users
  • Latency: Time To First Token (TTFT) under 500ms
  • Concurrency: 100k+ active connections per region
  • Durability: Immediate consistency for conversation storage
  • Reliability: Automatic failover between LLM backends
  • Scalability: Horizontal scaling for all stateless services

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
Global Load BalancerGoogle Cloud Load Balancing or AWS Global AcceleratorRoutes traffic to the nearest geographic region and handles SSL termination.Provides low-latency entry points and sophisticated health-checking across global regions.
Edge Gateway / API GatewayKong or EnvoyHandles authentication, rate limiting (per-tier), and request routing.High-performance proxy that supports custom plugins for quota management and JWT validation.
Chat & Context ServiceGo (Golang)Orchestrates chat logic, manages conversation state, and formats prompts.Golang's concurrency model (goroutines) is ideal for managing thousands of simultaneous streaming connections with low memory overhead.
Inference OrchestratorCustom microservice (Python/FastAPI or Go)Routes requests to LLM backends, handles retries, circuit breaking, and failover.Decouples the chat logic from specific LLM APIs, allowing for dynamic weight shifting and cost optimization.
Streaming EngineServer-Sent Events (SSE) over HTTP/2Maintains persistent connections for pushing tokens to the client.SSE is more efficient than WebSockets for unidirectional streaming from server to client and handles reconnections natively.
Usage & Billing ServiceApache FlinkTracks token consumption and costs per user/request for real-time quota enforcement.Required for real-time stream processing of token counts to prevent over-usage beyond quotas.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
PostgreSQL (with Citus)sqlEnsures ACID compliance and immediate consistency for chat history. Citus allows horizontal sharding to handle 500M messages/day.
RediscacheUsed for session management and caching recent conversation context to minimize DB hits during active turns.
ElasticsearchsearchProvides full-text search capabilities over millions of conversations with complex filtering (by date, model, or folder).
Amazon S3 / Google Cloud StorageblobDurable storage for multimodal inputs (images, PDF documents) and exported chat logs.
Apache KafkaqueueDecouples chat streaming from analytical/billing tasks. Ensures that slow storage or billing updates do not block the user response.

API Design

MethodEndpointDescription
POST/v1/auth/loginAuthenticates user and returns a JWT session token.
POST/v1/chat/completionsPrimary endpoint for sending messages. Supports 'stream: true' for SSE responses.
GET/v1/conversationsRetrieves a paginated list of the user's conversation history.
POST/v1/conversations/{id}/shareGenerates a public, read-only URL for a specific conversation thread.
POST/v1/files/uploadUploads multimodal content; returns a file ID for inclusion in chat completions.
GET/v1/modelsLists available LLM backends and their specific capabilities (e.g., vision, long context).

Scalability Strategy

The system scales horizontally at the service level using Kubernetes. The Chat Service and Inference Orchestrator are stateless, allowing auto-scaling based on CPU/Memory and concurrent connection counts. Database scalability is achieved through PostgreSQL sharding on 'user_id' to ensure data locality for a single user's history. Regional data isolation is used to meet 100k connection requirements per region, while a global Redis layer or DB replication handles shared state like public links.

Trade-offs

SSE over WebSockets

  • Lower overhead on the server
  • Better support for HTTP/2 multiplexing
  • Built-in reconnection logic
  • Unidirectional (server to client) only
  • Requires separate REST calls for client-side interruptions

Sharded PostgreSQL for History

  • Strong consistency for conversation turns
  • Relational schema fits organized folders/tags well
  • More complex to manage than NoSQL like DynamoDB
  • Requires careful shard key selection to avoid hotspots

Inference Orchestrator Abstraction

  • Seamless failover between providers
  • Centralized cost and rate limit tracking
  • Additional network hop (approx 10-20ms latency)
  • Single point of failure if not properly redundant
⏱ 85.3s 📅 2/8/2026, 1:55:11 PM 📁 design-chatgpt-gemini-3-pro-preview

Overview

A distributed, event-driven architecture designed to support 20M+ DAU for a ChatGPT-like application. The system leverages persistent WebSocket connections for low-latency streaming (TTFT < 500ms), a Model Orchestration Layer to abstract various LLM backends, and a tiered storage strategy (Redis -> DynamoDB -> S3) to handle the high write throughput of 500M messages per day. The design prioritizes interactivity and durability while ensuring strict cost governance and rate limiting.

Requirements

Functional

  • User authentication (SSO, MFA) and session management.
  • Real-time streaming of LLM responses via WebSockets.
  • Multi-turn conversation context management.
  • Model switching (e.g., GPT-4, Claude, Llama 3) per conversation.
  • Multimodal input handling (Images, PDF upload) via S3.
  • Conversation history management (Create, Rename, Delete, Archive).
  • Full-text search across conversation history.
  • Public link generation for sharing conversations.
  • Admin dashboard for cost tracking and user management.

Non-Functional

  • Latency: Time to First Token (TTFT) < 500ms.
  • Concurrency: Support 100k+ active WebSocket connections per region.
  • Availability: 99.99% uptime with multi-region failover.
  • Scalability: Horizontal scaling to handle 500M messages/day.
  • Durability: Zero data loss for conversation history.
  • Consistency: Immediate consistency for active chat, eventual consistency for search.
  • Billing Accuracy: Precise token counting for usage quotas.

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
Edge Gateway / API GatewayAWS Application Load Balancer + Kong GatewaySSL termination, Geo-routing, Rate limiting, Authentication verification.Kong provides robust plugin support for rate-limiting (Token Bucket) and JWT validation before traffic hits internal services.
Connection Manager (Chat Service)Go (Golang) on KubernetesManages WebSocket connections, broadcasts stream chunks, handles user state.Go's Goroutines are ideal for handling hundreds of thousands of concurrent WebSocket connections with low memory footprint compared to Node.js or Python.
Model OrchestratorPython (FastAPI) with LangChain adaptersStandardizes API calls to different LLM providers, handles retry logic, and failover.Python ecosystem has the best libraries for LLM integration. Isolating this allows independent scaling based on inference latency.
Context Assembly ServiceRust MicroserviceRetrieves relevant chat history and injects system prompts/RAG context before inference.Requires extremely low latency to fetch and tokenize text before sending to the LLM to meet the 500ms TTFT constraint.
Billing & Analytics ConsumerApache FlinkConsumes completed message events to calculate costs and update quotas.Stateful stream processing needed to aggregate token usage in real-time for strict quota enforcement.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
Amazon DynamoDBnosqlPrimary store for Chat History. Supports massive write throughput (500M msgs/day) and efficient querying by Partition Key (ConversationID) and Sort Key (Timestamp).
Redis ClustercacheStores active session state, recent conversation context (window), and user rate limit counters to minimize latency on the critical path.
Amazon S3blobStorage for user-uploaded images/documents. Low cost, high durability, and allows offloading bandwidth via Presigned URLs.
PostgreSQLsqlStores structured relational data: User profiles, Organization hierarchies, Billing Invoices, and configuration settings.
Elasticsearch / OpenSearchsearchProvides full-text search capabilities over chat history, which DynamoDB cannot handle efficiently.

API Design

MethodEndpointDescription
WS/ws/v1/chatMain WebSocket endpoint for bi-directional streaming of prompts and LLM responses.
POST/v1/conversationsCreates a new conversation thread, returns conversation_id.
GET/v1/conversations/{id}/messagesRetrieves paginated message history for a specific conversation.
GET/v1/modelsLists available LLM models user is authorized to use.
POST/v1/files/upload-urlGenerates a presigned S3 URL for uploading images or documents.

Scalability Strategy

Horizontal scaling via Kubernetes HPA based on CPU and custom metrics (Active WebSocket Connections). Database scales via DynamoDB On-Demand capacity or provisioned capacity with auto-scaling. The system is sharded by ConversationID for data locality. Redis Cluster handles hot-path reads. A Queue-based decoupling (Kafka) allows background tasks (search indexing, analytics) to scale independently of the real-time chat service.

Trade-offs

WebSockets over Server-Sent Events (SSE)

  • Bi-directional capability allows users to interrupt generation mid-stream.
  • Better support for future features like real-time voice or collaborative editing.
  • More complex load balancing and state management on the server.
  • Issues with corporate firewalls compared to standard HTTP/SSE.

DynamoDB for History (NoSQL) vs PostgreSQL

  • Predictable low-latency performance at infinite scale (20M DAU).
  • Schema flexibility for evolving message metadata (e.g., adding citations).
  • Complex queries (e.g., search) are impossible, requiring a secondary indexer (Elasticsearch).
  • Higher cost per GB compared to compressed cold storage in SQL/S3.

Async Token Counting (Post-generation)

  • Does not add latency to the streaming response.
  • Simplifies the hot path architecture.
  • Risk of minor quota overages if a user spams requests before the counter updates.
  • Complexity in reconciling partial streams if a connection drops.
⏱ 320.9s 📅 2/8/2026, 10:54:53 AM 📁 design-chatgpt-gpt-5

Overview

A globally distributed, real-time conversational AI platform supporting multi-turn chat with multiple LLM backends, multimodal inputs, and rich history management. The system is built for 20M DAUs and 500M messages/day with sub-500ms time-to-first-token via a high-performance WebSocket gateway, an LLM routing layer with fast failover, and region-affine, strongly consistent storage for conversation history. Analytics, cost tracking, and admin observability are first-class through an events pipeline into ClickHouse and Prometheus/Grafana. The core data plane is stateless, horizontally scalable on Kubernetes, and tolerant of provider or regional failures.

Requirements

Functional

  • User authentication (OAuth/social SSO) and session management
  • Create/read/update/delete multi-turn conversations and messages with context retention
  • Real-time streaming of assistant responses over WebSocket (token-by-token)
  • Conversation history: search, star, tag, foldering, archive, delete
  • Multiple model selection across providers and in-house inference
  • Rate limiting and quota enforcement by user/tier and per-model
  • File upload for images/documents with virus scan, OCR/text extraction, and multimodal prompt support
  • Share conversations via public links with configurable visibility
  • Admin dashboard: model/provider health, usage, cost, errors, rate limits/quota states
  • Cost tracking per request for accurate billing and cost attribution

Non-Functional

  • Time-to-first-token (TTFT) < 500ms p95
  • Support ≥100K concurrent WebSocket connections per region
  • Immediate consistency for conversation/message persistence
  • Regional fault tolerance and LLM provider failover with graceful degradation
  • Horizontal scalability to 500M messages/day (60k msgs/sec peak)
  • Data durability (multi-AZ), point-in-time recovery, backups
  • Security: WAF, DDoS protection, encryption in transit/at rest, least-privilege IAM
  • Observability: distributed tracing, metrics, logs, audit trails
  • Privacy and compliance readiness (GDPR/CCPA data subject controls)
  • Cost efficiency: autoscaling compute/GPU, cost-aware routing, storage tiering

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
Client Web AppNext.js + React, TypeScript, WebSocket, highlight.js, Markdown-itSPA for chat UI, Markdown rendering, code highlighting, WebSocket streaming, file uploads, search, and sharingMature ecosystem, SSR for SEO (shared links), excellent dev productivity and performance
Edge CDN/WAF & Global LBCloudflare CDN + Cloudflare Load Balancer + Bot ManagementTLS termination, caching static assets, DDoS/WAF, geo-steering to nearest healthy regionGlobal footprint, anycast, robust WAF and health-based geo-routing to meet latency and availability targets
API & WebSocket GatewayGo microservice on Kubernetes with NGINX Ingress (ALB) and HTTP/2; gorilla/websocket; gRPC to internal servicesSingle entry for REST and WebSocket; authZ/authN checks, rate limiting, session validation, request fan-out to internal services; streams tokens to clientGo delivers low-latency IO and high concurrency; stable WS handling; NGINX Ingress + ALB scale well
Auth ServiceAuth0 (OIDC) + JWT (RS256)User identity, OAuth/social login, JWT issuance, refresh tokens, RBAC/roles (user/admin)Fast to integrate, enterprise SSO, adaptive MFA; offloads identity risk; standards-compliant OIDC
Rate Limit & Quota ServiceEnvoy Global Rate Limit Service + Redis Cluster; Lua in NGINX for shadow checksEnforces per-user/tier rate limits (sliding window) and quotas; provides near-real-time countersEnvoy RLS is battle-tested; Redis offers sub-ms counters and atomicity with Lua scripts
Session/Cache StoreRedis Cluster (6.x) with Redis Streams for ephemeral eventsJWT blacklist, session metadata, ephemeral streaming buffers, recent context window cacheIn-memory speed, high availability via clustering and replication
Conversation ServicePostgreSQL (Citus) multi-tenant sharded by user_id; Go service using pgxCRUD for conversations/messages, context building, sharing ACLs, foldering/tags; transactional writesImmediate consistency and SQL semantics; Citus scales horizontally and keeps p95 low with partitioning
Search/Indexing ServiceOpenSearch (multi-az) + k-NN plugin; background workers (Go) for indexingFull-text search over titles/messages; semantic search via embeddings; indexing pipelineScalable search with near real-time indexing; k-NN for semantic search without extra vector DB
LLM RouterGo service with gobreaker, HTTP/2 keep-alive pools; provider SDKs; configuration via Consul/etcdModel catalog, routing to providers/in-house; health checks, circuit breakers, retries, cost-aware selection; streaming token multiplexingLow-latency, robust control plane with per-provider health and dynamic routing rules
Provider ConnectorsConnectors for "OpenAI/Anthropic/Azure OpenAI/Google Vertex"; retries with exponential backoff; streaming adaptersIntegrations to external LLMs and embeddingsDiversity reduces provider risk and enables cost/performance optimization
In-house Inference ClustervLLM on Kubernetes GPU nodes (NVIDIA A10/A100), Triton for embeddings; Istio for mTLSSelf-hosted models (vLLM) for failover and cost control; embeddings serverHigh throughput, streaming-friendly; cost-efficient for baseline models and embeddings
File Ingestion ServiceAmazon S3 + S3 Object Lambda (virus scan with ClamAV) + AWS Textract + Apache Tika; Step Functions for orchestrationPre-signed uploads, virus scanning, OCR/text extraction, chunking; links assets to messagesServerless pipeline scales elastically; S3 durability and cost efficiency for blobs
Cost & Billing ServiceKafka consumers (Go) -> ClickHouse for analytics; Postgres for authoritative balancesCompute per-request cost (provider rates, tokens, GPU time), store usage, expose invoices and quotasClickHouse excels at high-ingest analytics; Postgres for transactional balances and limits
Event BusApache Kafka (AWS MSK)Asynchronous events: usage, costs, audit logs, indexing triggersHigh-throughput, durable event streaming; ecosystem support
Analytics & MonitoringPrometheus + Grafana; OpenTelemetry + Jaeger; Loki for logs; CloudWatch for infraDashboards, alerts, traces, logsProven OSS stack, vendor-neutral instrumentation
Admin DashboardNext.js + RBAC; reads from ClickHouse/Prometheus/PostgresOperational UI: usage, costs, errors, provider health, throttles; model catalog managementUnified operational control plane with low-latency analytics queries
CDN Assets & Static HostingCloudflare + S3 static site hostingServe static JS/CSS/imagesGlobal low-latency delivery for assets

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
PostgreSQL (Citus)sqlStrong consistency and transactions for conversations/messages; Citus provides horizontal sharding by user_id with high write throughput and low-latency queries
Redis ClustercacheSub-millisecond counters for rate limits, sessions, ephemeral streaming buffers, and hot context windows
Amazon S3blobDurable, cost-effective storage for file uploads, images, and large attachments; lifecycle policies for tiering
OpenSearchsearchFull-text and semantic search with k-NN; scalable indexing and near real-time search for conversation history
Kafka (AWS MSK)queueDurable, scalable event streaming for usage, billing, indexing, and audit logs decoupling producers/consumers
ClickHousesqlHigh-ingest, columnar analytics for usage and cost reporting; sub-second aggregations at scale

API Design

MethodEndpointDescription
WS/v1/wsBidirectional WebSocket for sending user messages and receiving token-streaming responses and events
POST/v1/conversationsCreate a new conversation (title, tags, model selection, visibility)
GET/v1/conversationsList conversations with filters (folder, tag, starred) and pagination
GET/v1/conversations/{id}Get a conversation with messages (server-side pagination)
POST/v1/conversations/{id}/messagesAdd a user message to a conversation (text, file refs, tool calls)
GET/v1/messages/{id}Get message detail and streaming status
GET/v1/searchSearch conversations/messages (full-text + semantic options)
GET/v1/modelsList available models and tiers, pricing metadata
POST/v1/filesInitiate file upload and get pre-signed URL; returns file_id
POST/v1/share/{conversation_id}Create/update share link (public/unlisted/expire)
GET/v1/usagePer-user usage and remaining quota by period
GET/v1/admin/metricsAdmin: provider health, error rates, throughput, cost summaries
PUT/v1/admin/modelsAdmin: manage model catalog, routing weights, and availability

Scalability Strategy

- Traffic and sessions: Anycast via Cloudflare to nearest region. Sticky session not required; WebSocket connections are long-lived and evenly distributed via ALB. Gateway pods autoscale on CPU and open FDs; each Go pod targets ~4–5K concurrent WS; 30 pods suffice for 150K WS with headroom per region. - Storage: Citus shards by user_id across nodes; co-locate primary and replicas in same AZ to minimize latency. Connection pooling with PgBouncer. Hot partitions handled by rebalancing shards. PITR and logical replication to DR region. - Search: OpenSearch domain scales horizontally across data nodes. Index with 1–3 primary shards per index and ILM for rollover. Async indexers consume from Kafka for sustained throughput. - LLM routing: Health probes and circuit breakers per provider/region; latency-aware load balancing and hedged requests before first token. In-house vLLM autoscaling on GPU metrics (queue depth, tokens/sec). Keep-alive HTTP/2 pools to reduce TTFB. - Rate limiting: Redis Cluster with hash tags for per-user keys ensures single-shard updates. Use sliding window with Lua for atomicity. Quotas aggregated periodically from ClickHouse and persisted to Postgres for authority. - WebSockets: Separate HPA based on concurrent connections and network IO. Use SO_REUSEPORT and pod anti-affinity. Idle pings to detect dead peers. Backpressure controls to avoid OOM. - Multi-region: Active-active per region; users are region-affined based on home region stored in profile. Cross-region failover via Cloudflare LB health checks; if home region is down, read last durable snapshot from DR and write to degraded mode (queue for backfill) with user-notice. - Observability: OpenTelemetry traces propagate across gateway, LLM router, connectors. SLO-based autoscaling and alerting for TTFT and error budgets.

Trade-offs

Use PostgreSQL (Citus) for conversations instead of a globally-distributed DB (e.g., Spanner/CockroachDB)

  • Immediate consistency and strong SQL semantics
  • Operational familiarity and cost-effective scaling
  • Shard-by-user keeps latency low
  • Cross-region strong consistency not automatic
  • Requires shard management and regional affinity policies

WebSockets for streaming vs Server-Sent Events

  • Bidirectional control channel (cancel, tool calls)
  • Better concurrency characteristics and fewer HTTP headers
  • Widely supported by CDNs/LBs
  • More complex connection lifecycle management
  • Stateful long-lived connections impact autoscaling

OpenSearch for full-text + k-NN vs separate vector DB

  • Single search stack reduces ops overhead
  • Near real-time indexing with scalable ingestion
  • k-NN recall/latency may trail specialized vector DBs
  • Index tuning and memory footprint can be heavy

Auth0 for identity instead of building in-house

  • Faster time-to-market, robust SSO/MFA
  • Security updates and compliance handled by vendor
  • Vendor cost and dependency
  • Rate limits and token introspection latency considerations

Kafka + ClickHouse for analytics and cost tracking

  • High-ingest, low-cost analytics at scale
  • Flexible schema evolution for events
  • Operational complexity (clusters to manage)
  • Eventual consistency for analytics views

In-house vLLM cluster as secondary routing target

  • Cost control for common models
  • Resilience to third-party outages
  • Requires GPU capacity management and model maintenance
  • May lag on latest proprietary model capabilities

Async indexing to OpenSearch (eventual) while DB is immediate

  • Protects write path latency and TTFT
  • Scales independently
  • Search results may lag by seconds
  • Requires reconciliation jobs for rare failures
⏱ 138.3s 📅 2/8/2026, 10:57:24 AM 📁 design-chatgpt-gpt-5-mini

Overview

A globally distributed, real-time web platform that enables multi-turn conversations with configurable LLM backends, streaming responses token-by-token, durable conversation history, multimodal inputs, per-user quotas and billing, and admin monitoring. The design uses managed cloud components where appropriate (AWS examples used for concreteness) and is built for 20M DAU and ~500M messages/day — with multi-region deployment, autoscaling WebSocket clusters, strong consistency for conversation data, semantic search, and resilient LLM backend routing with automatic failover and cost accounting.

Requirements

Functional

  • User authentication, registration, password reset and session management
  • Create and continue multi-turn conversation threads with context retention
  • Real-time streaming of LLM responses token-by-token to clients
  • Durable and immediately consistent conversation history (read/write immediately consistent)
  • Search and organize conversation history (text + semantic search)
  • Support multiple LLM backends and per-conversation model selection
  • Rate limiting and per-tier usage quotas; block/soft-limit enforcement
  • Markdown rendering with safe sanitization (code blocks, tables, etc.)
  • File upload and multimodal input handling (images, documents) with safe storage and processing
  • Share conversations via public links (read-only) with optional expiry
  • Admin dashboard for usage, costs, quota management, and system health
  • Per-request cost tracking for accurate billing

Non-Functional

  • Scale to 20M daily active users and 500M messages/day
  • Support at least 100K concurrent WebSocket connections per region
  • Start streaming first token within 500ms of request
  • Conversation history must be durable and immediately consistent
  • High availability and graceful degradation on LLM backend failures with automatic failover
  • Low latency (P95 request response times within reasonable bounds) and high throughput
  • Secure file handling, sanitization, and access controls
  • Observability: request tracing, per-request cost telemetry, metrics and logs
  • Regulatory considerations: data residency and GDPR-friendly features (export/delete)

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
Edge / CDNAWS CloudFront + AWS WAF (or Cloudflare) for global CDN & edge securityGlobal caching, TLS termination, hosting static assets, and routing to nearest API region. Protect against DDoS and serve prerendered content.Low latency global content delivery and edge protections. CloudFront integrates with regional ALBs and AWS API Gateway; WAF provides DDoS/IPS rules.
API Gateway (REST & WebSocket)AWS API Gateway (HTTP/APIGW v2 for WebSocket) or AWS Application Load Balancer + NLB for WebSocket if custom stack preferredIngress point for HTTP(S) REST APIs and managed WebSocket connections, authentication/authorization integration, metrics, and throttling.Managed API Gateway handles large scale WebSocket connections reliably and integrates with Lambda and VPC targets. Reduces operational burden to meet 100K+ connections per region.
Auth Service / IdentityAuth0 or Amazon Cognito (or self-hosted Keycloak for more control)User authentication (email/password, OAuth), session issuance, token lifecycle, MFA, and account management.Managed identity reduces time to market; Cognito/Auth0 handle scaling, OIDC/OAuth flows, social login, and integrate with API Gateway and IAM. Can fallback to Keycloak if self-hosting required for compliance.
Frontend (Web & Mobile clients)React + Next.js for Web (SSR), React Native for mobile; use WebSocket & SSE clients for streaming. Use remark/rehype for markdown rendering and DOMPurify for sanitization.UI for conversations, streaming UI, markdown rendering/sanitization, file uploads, sharing links, offline behaviors, and websocket clients.Next.js gives performant SSR/CSR mix and edge support; well-supported libraries for markdown and security.
Connection Manager / WebSocket WorkersKubernetes (EKS) running horizontally scaled WebSocket worker pods behind API Gateway or ALB, using Envoy/ingress for routing. Use Redis for presence/connection metadata.Maintain WebSocket connections, route tokens to clients, enforce per-connection rate limits, maintain ephemeral state, and connect to LLM streaming output.Kubernetes provides autoscaling and lifecycle control. Breaking stream work into worker pods allows streaming token-by-token with low-latency writes to sockets; Redis stores connection mapping for routing in multi-pod deployments.
Conversation Service (API)Stateless microservice in Kubernetes (gRPC/HTTP) with connection to Aurora PostgreSQL (Primary writer) and a caching layer (Redis).Handles conversation CRUD, multi-turn context assembly, versioning, bookmarks, shareable link creation, and immediate persistent writes.Stateless services scale easily. Aurora PostgreSQL provides strong consistency and supports high write throughput with multi-AZ. Redis accelerates hot path reads and rate-limit checks.
Message Ingest & Streaming Orchestrator (LLM Router)Stateless microservice (Kubernetes) using gRPC to LLM backends; feature-rich router capability using Hystrix-like circuit breaker libraries and per-model adapters. Persist logs & events to Kafka (MSK) for downstream processing.Orchestrates sending prompts to selected LLM backend(s), streams tokens back to Connection Manager, calculates per-request cost, applies circuit-breakers and failover to alternate models/backends, and logs telemetry.Centralized routing simplifies failover, cost accounting, and policy enforcement. gRPC yields low-latency backend calls; Kafka provides durable eventing for billing and analytics.
LLM BackendsHybrid: External providers (OpenAI/Anthropic) + Internal GPU clusters orchestrated by Kubernetes + Triton / NVIDIA TensorRT / Ray Serve for model serving. Use model proxies that expose gRPC or HTTP streaming.Provide model inference and token streaming. Could be managed external APIs (OpenAI, Anthropic) and/or internal GPU clusters (private models).Hybrid provides capacity and cost controls: external for burst/spiky loads and internal for steady-state/private models. Triton/Ray Serve are production-ready for large model serving with streaming support.
Cache & Rate-Limit StoreRedis (Amazon ElastiCache in clustered mode with clustering-enabled Redis or Redis Enterprise)Fast token-bucket rate limits, session cache, short-lived conversation caches for hot reads, and presence store.Redis supports very low-latency operations, atomic counters, Lua scripting for rate-limiting logic, and clustering for scale.
Durable Storage (Conversations / Metadata / Billing)Amazon Aurora PostgreSQL (clustered, multi-AZ, read-replicas) with partitioning/sharding by tenant or hashed conversation id.Immediate-consistency primary store for conversations, messages, user metadata, billing records, and access controls.Relational strong consistency and transactions for immediate-consistency requirement; Aurora scales reads and provides high durability and automated backups.
Object Store (Files & Attachments)Amazon S3 with S3 Object Lambda hooks; presigned uploads; Lambda for scanning via ClamAV or third-party virus scanningStore uploaded files (images, docs) and serve them to model pipelines and clients via presigned URLs; lifecycle & virus-scan results.S3 is durable, scalable, and cost-effective; presigned uploads offload bandwidth; Lambda-based scanning pipeline can be used asynchronously.
Search & EmbeddingsHybrid: OpenSearch (for keyword/structured search) + Vector DB (Pinecone, Milvus, or Amazon OpenSearch vector plugin) for embeddings. Use a managed embedding service or produce embeddings via dedicated model instances and store vectors in vector DB.Text and semantic search across conversation history and attachments; embedding generation and vector search.OpenSearch handles traditional search and filters; vector DB supports semantic similarity at scale. Separating concerns lets us scale search independently.
Event Bus / Streaming & AnalyticsApache Kafka (Amazon MSK) for high-throughput durable logs; Kafka Connect to data warehouse (Snowflake/BigQuery) and stream processors (Flink/Kafka Streams).Durable eventing for audit logs, billing events, metrics, and asynchronous jobs (indexing, notifications, cost aggregation).Kafka scales well for hundreds of thousands of events/sec and supports exactly-once processing patterns enabling accurate billing and analytics.
Billing & Cost AccountingService that consumes Kafka billing events, applies per-model cost rates, stores detailed line-items in PostgreSQL and aggregates in OLAP (BigQuery/Snowflake) for reports. Use serverless ETL for daily aggregation.Accurate per-request cost tracking, aggregation to user billing, tier enforcement, and exports to billing system.Event-driven accounting keeps near-real-time cost tracking for each request; OLAP enables fast analytics and admin dashboards.
Admin Dashboard & ObservabilityPrometheus + Grafana for metrics; Jaeger for distributed traces; ELK/OpenSearch for logs; Grafana dashboards with role-based access. Admin frontend built on React + RBAC.System metrics, alerts, per-user/tier usage, cost dashboards, model-health, and structured logs.Standard observability stack with tracing allows operators to debug and monitor the system and analyze cost/usage.
Security & ComplianceAWS KMS for secrets, IAM for infra access control, Vault (HashiCorp) for application secrets if self-hosting; S3 encryption and TLS everywhere.Access controls, secret management, key rotation, audit logs, data deletion/export endpoints, encryption at rest/in transit, and DLP for file scanning.Managed key stores and RBAC minimize operational overhead while meeting compliance.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
Amazon Aurora PostgreSQLsqlProvides immediate consistency, transactions, and durability for conversation history and billing line-items. Aurora supports multi-AZ, read-replicas, partitioning/sharding, and scales to high throughput with proper schema design.
Redis (ElastiCache Clustered)cacheLow-latency data for rate-limiting, session/presence mapping, token-bucket counters, and ephemeral caching of recent conversation context for fast reads.
Amazon S3blobDurable, cost-efficient object storage for user uploaded files and model artifacts. Supports presigned uploads and lifecycle policies; integrates with object-lambda for scanning/transformations.
Apache Kafka (Amazon MSK)queueDurable, high-throughput event stream for message events, billing events, and indexing streams. Enables decoupled asynchronous processing (search indexing, billing aggregation, analytics).
OpenSearch (Elastic) + Vector DB (Pinecone or Milvus)searchOpenSearch for keyword/structured search and filters; vector DB for semantic similarity search on embeddings. Scales independently and supports fast retrieval of relevant conversation segments.
OLAP (BigQuery or Snowflake)nosqlFor cost/billing analytics and historical reporting at scale. Stores aggregated billing/usage records and enables fast analytics for admin dashboards and finance exports.

API Design

MethodEndpointDescription
POST/api/v1/auth/loginAuthenticate user (email/password or OAuth token exchange). Returns access token and refresh token. Initiates session and rate-limit metadata.
GET/api/v1/conversationsList user conversations with pagination, sorting, and filters (by tag, model, shared). Uses read-replica; consistent with write-through caching invalidation.
POST/api/v1/conversationsCreate a new conversation; specify model, system prompt, privacy/sharing options, and optional attachments.
POST/api/v1/conversations/{conversationId}/messagesSend a new user message to a conversation. Persists message, triggers inference via LLM Router, and returns inference-id. Supports multimodal references (file IDs).
WS/api/v1/conversations/{conversationId}/streamWebSocket endpoint for real-time streaming of LLM responses (token-by-token) and message events. Supports client acknowledgements, reconnect/resume semantics, and server pings.
POST/api/v1/filesRequest presigned URL for upload or upload metadata. After upload, file is scanned asynchronously; returns file ID for model input.
GET/api/v1/modelsList available models with capabilities, estimated cost/token, latency SLAs, and fallback rules.
POST/api/v1/conversations/{conversationId}/shareCreate a public, shareable link (read-only) with optional expiry and password protect settings.
GET/api/v1/admin/metricsAdmin-only metrics endpoint aggregated from Prometheus/OLAP for usage, costs, model health, and alerts. Requires admin RBAC.

Scalability Strategy

Multi-region deployment with region-local clusters (API Gateway + EKS + Aurora in each region or read-only replicas cross-region depending on data residency). Horizontal scaling: stateless frontends and LLM Router scale via Kubernetes HPA/KEDA based on CPU/RPS/queue length. WebSocket workers scale horizontally; use managed API Gateway or ALB to handle connection scaling. Redis is scaled as a clustered ElastiCache with sharding; Aurora can be scaled by sharding conversations by tenant or hashing conversationId to different writer clusters for write throughput. Use Kafka (MSK) partitions scaled by throughput and consumer groups for parallel processing. Use autoscaling GPU pools for internal model serving (using Karpenter/Cluster Autoscaler) and spot instances to reduce cost for non-critical capacity. Employ edge caching (CloudFront) for static assets and read-heavy metadata. For search and embeddings, scale vector DB clusters independently. For global throughput, employ traffic steering to nearest region with failover and active-passive or active-active DB strategy where legal/regulatory constraints permit.

Trade-offs

Use Aurora PostgreSQL (SQL) as primary conversation store versus a NoSQL store

  • Strong consistency and transactional integrity meeting immediate-consistency requirement
  • Familiar SQL tooling for analytics and billing joins
  • ACID semantics make concurrency around multi-turn context safer
  • Higher cost at scale and more complex sharding strategy required for very high write throughput
  • Scaling writes requires sharding/partitioning; operational complexity compared to unlimited-scaled NoSQL

Managed API Gateway + managed WebSocket vs self-hosted WebSocket tier

  • Reduces operational burden and more predictable scalability to meet 100K+ connections per region
  • Integrated auth/metrics and DDoS protections
  • Potentially higher cost and less low-level customization than self-hosted approach
  • Proprietary limits and vendor lock-in

Hybrid LLM Backends (external providers + internal GPU clusters)

  • Flexibility: use cheaper internal models for baseline load and external providers for burst or advanced models
  • Avoids total dependency on third-party providers and gives control over privacy/compliance
  • Complexity in routing, capacity planning, and maintaining model infra
  • Potentially higher ops cost to run GPU clusters

Kafka (MSK) for events vs serverless queues (Kinesis/Lambda)

  • Kafka provides high throughput, ordering guarantees, and complex stream processing needed for billing accuracy
  • Good for exactly-once or at-least-once semantics required by billing and audit trails
  • Operational complexity and higher management overhead vs serverless alternatives
  • Higher cost at small scale; more moving parts

Use Redis for rate-limiting and session store

  • Extremely low latency and atomic operations (Lua) for precise rate-limiting
  • Widely adopted pattern and straightforward to implement token-bucket quotas
  • Single point of failure risk if not configured in clustered/high-availability mode
  • Operational overhead to scale and tune eviction/persistence

Vector DB (Pinecone/Milvus) for semantic search vs using OpenSearch alone

  • Vector DB optimized for nearest-neighbor semantic search at scale and supports high-dimensional vectors
  • Better latency and scalability for similarity queries
  • Adds another datastore to maintain and sync embeddings (operational overhead)
  • Extra cost; integration complexity with indexing pipelines
⏱ 221.0s 📅 2/8/2026, 1:59:52 PM 📁 design-chatgpt-gpt-5.2

Overview

The system is a globally distributed, multi-tenant conversational AI web application supporting authenticated users, multi-turn threads, token-streaming responses, file/multimodal inputs, conversation search, sharing links, and an admin cost/usage dashboard. It is designed for 20M DAU and ~500M messages/day with strict latency requirements (TTFT < 500ms) and high concurrency (>=100K concurrent WebSocket connections per region). The architecture separates the latency-critical request/streaming path (WebSocket Gateway + Orchestrator + LLM adapters) from durable storage, indexing, analytics, and billing pipelines. Conversations are stored in a strongly consistent SQL store, while search and analytics are powered by specialized systems. LLM backend failures are handled via circuit breakers, hedged requests, and provider failover with per-token streaming preserved.

Requirements

Functional

  • User authentication (SSO/email), session management, and tier entitlements
  • Create/read/update conversation threads with multi-turn context retention
  • Real-time token-by-token streaming of assistant responses to the client
  • Model selection per conversation/message across multiple LLM backends
  • Conversation history browsing, organization (folders/tags), and search
  • File upload (images/documents) and multimodal prompts
  • Share conversations via public links with optional redaction/permissions
  • Rate limiting and usage quotas per user/tier with enforcement
  • Markdown rendering support (code blocks, tables) and safe sanitization
  • Admin dashboard for monitoring usage, latency, errors, and costs

Non-Functional

  • Scale: 20M DAU, 500M messages/day, avg 10 turns/conversation
  • Latency: streaming must start within 500ms time-to-first-token
  • Concurrency: >=100K concurrent WebSocket connections per region
  • Durability + immediate consistency for conversation history
  • High availability with automatic failover across LLM providers/regions
  • Accurate per-request/per-token cost tracking for billing
  • Security: encryption in transit/at rest, least privilege, audit logs
  • Compliance readiness (PII controls, retention policies, GDPR delete)
  • Operational excellence: observability, alerting, safe deploys (canary)
  • Abuse prevention: bot detection, prompt injection/file malware scanning

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

ComponentTechnologyResponsibilityJustification
Web Client (Browser) + MobileNext.js (React) + TypeScript; Markdown-it + DOMPurify; WebSocket clientUI for chat, conversation list, model selection, file upload, markdown rendering, and real-time streaming display via WebSocket/SSE fallback.Next.js supports SSR/SPA, fast iteration, and edge-friendly deployments. Markdown-it is extensible for code blocks/tables; DOMPurify prevents XSS. WebSocket enables low-latency bidirectional streaming.
Global DNS + CDN/WAFCloudflare (DNS, CDN, WAF, Bot Management)Global traffic steering, TLS termination at edge, DDoS protection, caching of static assets, WAF rules, and bot mitigation.Strong global presence reduces latency and protects origin. Bot/DDoS controls are critical at 20M DAU.
API Gateway / EdgeEnvoy Gateway (Kubernetes) + Cloudflare origin rulesRouting for REST APIs and WebSocket upgrades, auth pre-checks, request shaping, and regional failover.Envoy provides high-performance L7 routing, retries, timeouts, and observability. Works well with WebSockets and service mesh patterns.
Auth & Session ServiceAuth0 (OIDC) + internal Session API using JWT (short-lived) + Redis for session revocationUser signup/login, OAuth/OIDC, session issuance, refresh, MFA support, and entitlement lookup for tiers.Auth0 reduces security risk and time-to-market. Short-lived JWT minimizes DB calls; Redis enables immediate revocation/ban.
WebSocket Gateway (Streaming Gateway)Kubernetes-deployed Node.js (uWebSockets.js) or Go (fasthttp + websocket) service; Redis Cluster for ephemeral connection metadataManages WebSocket connections, fan-out of token streams, backpressure, connection state, and regional scaling to >=100K concurrent connections.Specialized gateway isolates long-lived connections from general API traffic. Go/uws handle high concurrency efficiently; Redis supports lightweight presence/state without coupling to DB.
Chat Orchestrator ServiceGo microservice (gRPC internally) with circuit breakers (hystrix-like) and retries (Envoy + app-level)Core chat workflow: validate quotas, build context, call LLM backends, stream tokens, handle tool/file references, persist messages atomically, and emit usage/cost events.Go offers predictable latency and high throughput. Central orchestration simplifies consistency and billing correctness while keeping streaming path tight.
LLM Provider Adapter LayerInternal service/library used by Orchestrator; supports OpenAI-compatible streaming + Bedrock + Anthropic; optional self-hosted vLLM on GPU nodesUniform interface for multiple model providers (e.g., OpenAI, Anthropic, AWS Bedrock, self-hosted vLLM), token streaming normalization, automatic failover/hedging, and provider-specific auth.Decouples product from provider APIs and enables rapid switching, routing, and fallback strategies to meet availability/latency constraints.
Conversation ServiceJava/Kotlin (Spring Boot) or Go; PostgreSQL-compatible distributed SQL (YugabyteDB)CRUD for conversations, messages, metadata (title, tags, folders), share settings, and immediate-consistency reads.Distributed SQL provides strong consistency with horizontal scaling and multi-region resilience. A dedicated service encapsulates schema and access patterns.
Search/Indexing ServiceElasticsearch (managed, e.g., Elastic Cloud) + Kafka Connect for indexing pipelineIndex conversation/message text and metadata for fast search, filtering, and ranking; supports near-real-time updates.Elasticsearch is well-suited for full-text search and faceting at large scale. Kafka-based ingestion decouples indexing from the write path.
File Ingestion & Multimodal PipelineS3-compatible object storage (Amazon S3) + CloudFront signed URLs; ClamAV scanning; Apache Tika for parsing; optional GPU service for vision embeddingsHandle uploads, virus/malware scanning, document parsing (PDF/DOCX), image preprocessing, OCR, embedding generation, and secure storage/links.Object storage is the standard for large binary data. Scanning and parsing protect the platform. Signed URLs reduce origin load and limit unauthorized access.
Rate Limiting & Quota ServiceRedis Cluster (token bucket/leaky bucket) + internal Quota API; optional Envoy global rate limit servicePer-user/per-tier rate limits (RPS), token quotas, daily/monthly usage, and enforcement in the hot path.Redis offers sub-millisecond counters suitable for the 500ms TTFT constraint. Central policy keeps enforcement consistent across gateways.
Usage/Cost Metering ServiceKafka + stream processing (Apache Flink) + ClickHouse for analytics + PostgreSQL ledger tablesCompute accurate costs per request (tokens in/out, model pricing, file processing costs), generate billing-grade ledgers, and expose aggregates to admin/user dashboards.Flink enables real-time aggregation while a PostgreSQL ledger ensures correctness and auditability. ClickHouse supports high-QPS analytics for dashboards.
Sharing ServiceGo service + PostgreSQL (YugabyteDB) + CDN caching for public read viewsCreate public share links, snapshot/redaction, access control, and view tracking.Share links require durable mapping and permissions. CDN accelerates read-heavy public access.
Admin & Observability StackPrometheus + Grafana; OpenTelemetry + Tempo/Jaeger; Loki; Sentry; Argo Rollouts for canaryMonitoring, tracing, logging, incident response, and admin dashboard for usage/cost/latency/provider health.Standard cloud-native observability with strong ecosystem; canary reduces risk when changing critical streaming paths.
Message Bus / Event BackboneApache Kafka (managed, e.g., Confluent Cloud)Decouple write path from indexing, analytics, notifications, and offline processing.Kafka scales to very high throughput (500M messages/day) and enables replayable event streams for multiple consumers.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

StoreTypeJustification
YugabyteDB (PostgreSQL-compatible distributed SQL)sqlStrong consistency and durability with horizontal scaling and multi-region replication; ideal for immediately consistent conversation history and share-link metadata.
Redis ClustercacheSub-millisecond counters for rate limiting/quota enforcement; session revocation; ephemeral WebSocket connection metadata.
Apache KafkaqueueHigh-throughput event backbone to decouple indexing, analytics, metering, and async file processing from the latency-critical chat path.
Amazon S3 (Object Storage)blobDurable, scalable storage for user uploads (images/documents) and generated artifacts; integrates with signed URLs and lifecycle policies.
ElasticsearchsearchFull-text search with faceting for conversation history at scale, supporting near-real-time indexing from Kafka.
ClickHousenosqlHigh-performance OLAP for admin/user dashboards on usage, costs, latency, and provider performance.
PostgreSQL (Billing Ledger)sqlBilling-grade immutable ledger entries require strict constraints, transactions, and auditability; kept separate from high-volume chat OLTP.

API Design

MethodEndpointDescription
POST/v1/auth/sessionExchange OIDC code for application session (JWT/refresh), return user profile and tier entitlements.
POST/v1/conversationsCreate a new conversation (optionally with selected model, system prompt, folder/tags).
GET/v1/conversations/{conversationId}Fetch conversation metadata and messages with strong consistency (latest turns).
POST/v1/conversations/{conversationId}/messagesSend a user message (non-streaming fallback) and receive the assistant response when complete.
WS/v1/ws/chatWebSocket endpoint for streaming chat. Client sends message frames; server streams tokens/events (delta tokens, tool/file status, final).
POST/v1/filesRequest an upload session; returns signed upload URL(s) and fileId(s).
GET/v1/files/{fileId}Fetch file metadata and processing status (scanned/parsed/ready).
GET/v1/searchSearch conversations/messages by query, filters (date, model, tags), and pagination.
POST/v1/shareCreate a public share link for a conversation snapshot with optional redaction rules.
GET/v1/share/{shareId}Retrieve shared conversation snapshot for public viewing (read-only).
GET/v1/usageReturn current usage, remaining quotas, and recent cost estimates for the authenticated user.
GET/v1/admin/metricsAdmin-only: aggregated metrics (DAU, messages, token volume, costs, provider error rates/latency).

Scalability Strategy

Global active-active deployment across multiple regions (at least 3) with GeoDNS steering to nearest healthy region. WebSocket Gateways scale horizontally behind Envoy with connection-aware load balancing; keep services stateless and store only ephemeral connection metadata in Redis. The hot path (quota check, context fetch, LLM streaming) is optimized for TTFT by: (1) precomputing and caching conversation summaries, (2) limiting context window with rolling summarization, (3) parallelizing context fetch and file metadata fetch, and (4) using hedged requests to LLM providers after a short delay when p95 latency rises. Conversation history writes are strongly consistent using distributed SQL with synchronous replication and tuned transaction boundaries (persist user message immediately; persist assistant message incrementally with periodic checkpoints, then finalize). Kafka decouples indexing/analytics and supports replay. Elasticsearch scales by sharding by tenant/time; ClickHouse scales by distributed tables and partitioning by date/model. Rate limiting uses Redis Cluster with key hashing by userId to spread load; per-tier policies are cached at gateways. For 500M messages/day, partition Kafka topics by conversationId hash, and use consumer groups for Search and Metering pipelines. LLM adapters implement circuit breakers, bulkheads per provider, and region-aware routing; self-hosted vLLM provides a fallback capacity pool for reliability and cost control.

Trade-offs

Use YugabyteDB (distributed SQL) for conversation history instead of DynamoDB/Cassandra

  • Strong consistency and SQL transactions simplify immediate-consistency requirements
  • Secondary indexes and relational modeling for conversations/messages/shares
  • Multi-region replication and HA with familiar Postgres ecosystem
  • Higher operational complexity and cost than single-region Postgres
  • Write latency can increase with synchronous multi-region replication
  • Careful schema/partition design needed to avoid hotspots

WebSocket Gateway as a separate tier from REST API services

  • Optimized for long-lived connections and high concurrency (100K+ per region)
  • Isolates streaming workloads from standard API traffic
  • Simplifies backpressure handling and connection lifecycle management
  • Additional component to operate and secure
  • More complex debugging across gateway-orchestrator boundary

Redis-based quota enforcement in the hot path

  • Very low latency suitable for TTFT < 500ms
  • Supports token bucket algorithms and tier-based policies
  • Reduces load on primary databases
  • Distributed counters require careful design for correctness (race conditions)
  • Redis outages can block traffic unless graceful degradation is implemented

Kafka event-driven pipelines for search indexing and cost analytics

  • Decouples latency-critical chat from heavy indexing/analytics
  • Enables replay, backfills, and multiple consumers
  • Handles very high throughput (500M messages/day)
  • Eventual consistency for search/analytics (not for core conversation reads)
  • Requires schema governance and exactly-once/at-least-once considerations

Elasticsearch for conversation search

  • Best-in-class full-text search, faceting, and relevance tuning
  • Scales horizontally via sharding and replicas
  • Rich query DSL for product features
  • Operational overhead: shard sizing, reindexing, cluster tuning
  • Index lag (seconds) unless aggressively tuned

LLM adapter with automatic failover and hedged requests

  • Improves availability and tail latency under provider issues
  • Abstracts provider-specific streaming formats and pricing
  • Supports routing by cost/performance/tier
  • Complexity in maintaining consistent user experience across providers
  • Risk of duplicate costs with hedged requests if not carefully canceled
  • Provider output differences can affect response consistency

Billing-grade cost ledger in PostgreSQL separate from OLTP conversation store

  • Strong auditability and immutability patterns for billing
  • Protects core conversation store from analytics/billing query load
  • Simplifies reconciliation and dispute handling
  • Data duplication and additional ETL/stream processing
  • Requires reconciliation logic between provider usage and internal metering

S3 + signed URLs for file uploads and downloads

  • Highly scalable and cost-effective for large binary storage
  • Offloads bandwidth from application services
  • Supports lifecycle policies and encryption controls
  • Requires careful access control to prevent link leakage
  • Additional pipeline complexity for scanning/parsing and status tracking