HLD-Bench Report

ChatGPT-like Conversational AI Web Application - High-Level Design

⏱ 355.8s 📅 2/8/2026, 2:32:04 PM 📁 design-chatgpt-claude-opus-4-6

Overview

This system is a large-scale conversational AI platform serving 20 million daily active users generating 500 million messages per day. The architecture follows a microservices pattern with clear separation between the real-time streaming layer, conversation management, LLM orchestration, and supporting services. The core design centers on a WebSocket-based streaming gateway that delivers token-by-token responses with sub-500ms time-to-first-token, backed by an LLM orchestration layer that abstracts multiple model backends (OpenAI, Anthropic, self-hosted) with automatic failover. Conversations are persisted in a sharded PostgreSQL cluster for immediate consistency, with Redis caching for hot conversation context, and S3 for file/multimodal uploads. The system is designed for multi-region deployment with regional WebSocket gateways, global CDN for static assets, and a robust rate-limiting and billing pipeline that tracks per-request token costs. Key architectural decisions include using Server-Sent Events (SSE) over WebSocket for streaming simplicity, CQRS for separating write-heavy message ingestion from read-heavy history/search workloads, and an event-driven architecture via Kafka for decoupling billing, analytics, and audit concerns from the critical path. The admin dashboard is powered by a dedicated analytics pipeline built on ClickHouse for real-time usage monitoring and cost attribution.

Requirements

Functional

User registration, authentication (email, OAuth), and session management with JWT tokens
Create, continue, and manage multi-turn conversation threads with full context retention
Real-time streaming of LLM responses token-by-token to the client
Persistent conversation history with full-text search and folder/tag organization
Model selection allowing users to choose between different LLM backends per conversation
File upload support for images, PDFs, and documents with multimodal input to LLMs
Share conversations via unique public links with optional expiration
Admin dashboard for monitoring usage metrics, costs, active users, and system health
Rate limiting and tiered usage quotas (free, plus, enterprise) with enforcement
Markdown rendering support in responses including code blocks, tables, LaTeX, and syntax highlighting

Non-Functional

Time to first token must be under 500ms for streaming responses
Support at least 100K concurrent WebSocket/SSE connections per region
Conversation history must be durable with immediate consistency (no eventual consistency for user-facing reads)
Handle LLM backend failures with automatic failover to alternative providers within 2 seconds
Per-request cost tracking for accurate billing with less than 0.1% error rate
99.95% availability SLA for the overall platform
Horizontal scalability to handle 500M messages/day (~5,800 messages/sec average, 20K+ peak)
P99 API response latency under 200ms for non-LLM endpoints (history, search, auth)
Data encryption at rest and in transit, SOC2 compliance readiness
Multi-region deployment with data residency compliance for EU/US users
Graceful degradation under load — queue overflow should return informative wait messages rather than errors

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
API Gateway	`Kong Gateway (on Kubernetes)`	Entry point for all client requests. Handles TLS termination, request routing, authentication verification, rate limiting enforcement, and load balancing across backend services.	Kong provides built-in rate limiting, JWT validation, request transformation, and plugin ecosystem. It handles both HTTP and WebSocket upgrade requests, supports declarative config via Kubernetes CRDs, and scales horizontally. Preferred over AWS API Gateway for lower latency and more control over WebSocket handling.
Streaming Gateway	`Custom Go service with nhooyr/websocket`	Manages long-lived SSE/WebSocket connections for real-time token streaming from LLM backends to clients. Handles connection lifecycle, heartbeats, backpressure, and reconnection.	Go excels at handling massive concurrent connections with minimal memory overhead (goroutines use ~4KB vs threads). A custom service allows precise control over backpressure, connection draining, and graceful failover. Each instance can handle 50K+ concurrent connections, needing only 2-3 instances per region for 100K target.
Auth Service	`Node.js with Passport.js + Redis session store`	User registration, login (email/password, Google OAuth, GitHub OAuth), JWT issuance and refresh, session management, and password reset flows.	Passport.js has mature OAuth provider integrations. Node.js is well-suited for I/O-bound auth workflows. Redis stores refresh tokens and session blacklists for O(1) lookups. JWTs are short-lived (15min) with Redis-backed refresh tokens for revocation capability.
Conversation Service	`Python (FastAPI)`	Core business logic for creating conversations, appending messages, managing conversation metadata (titles, folders, tags), and serving conversation history with pagination.	FastAPI provides async support, automatic OpenAPI docs, and excellent Python ecosystem integration for ML/AI tooling. Python aligns with the broader AI/ML ecosystem making it easy to integrate tokenizers, prompt engineering libraries, and model-specific utilities.
LLM Orchestrator	`Python (FastAPI) with LiteLLM`	Abstracts multiple LLM backends, handles model routing based on user selection, manages prompt assembly with conversation context, implements retry/failover logic, and streams tokens back to the Streaming Gateway.	LiteLLM provides a unified interface to 100+ LLM providers (OpenAI, Anthropic, Cohere, self-hosted vLLM). FastAPI's async streaming support enables efficient token forwarding. The orchestrator implements circuit breaker patterns per backend and automatic failover when a provider returns errors or exceeds latency thresholds.
File Processing Service	`Python with Celery workers`	Handles file upload, validation, virus scanning, format conversion, image resizing, OCR for documents, and preparing multimodal inputs for LLM consumption.	File processing is CPU-intensive and variable in duration — Celery workers can scale independently. Python has excellent libraries for image processing (Pillow), PDF extraction (PyMuPDF), and OCR (Tesseract). Workers pull from a Redis-backed task queue for reliable processing.
Search Service	`Elasticsearch 8.x`	Full-text search across conversation history, semantic search for finding relevant past conversations, and powering the organization/filtering UI.	Elasticsearch provides fast full-text search with relevance scoring, supports nested document structures ideal for conversations with messages, and offers built-in vector search (kNN) for semantic search. The inverted index is highly optimized for the search-heavy read pattern of conversation history.
Rate Limiter & Quota Service	`Redis Cluster with Lua scripts`	Enforces per-user, per-tier rate limits (requests/min, tokens/day), tracks usage quotas, and signals the API gateway to throttle or reject requests.	Redis provides sub-millisecond rate limit checks using sliding window counters implemented via Lua scripts for atomicity. Redis Cluster enables horizontal scaling. Token bucket and sliding window algorithms are implemented for different rate limiting needs (burst vs sustained).
Billing & Cost Tracking Service	`Go service consuming from Kafka`	Records per-request token usage and costs, aggregates billing data per user/organization, generates invoices, and feeds cost data to the admin dashboard.	Go provides the performance needed for high-throughput event processing. Kafka consumption decouples billing from the critical request path — if billing is slow, it doesn't affect user experience. Go's strong typing and low GC pauses ensure accurate, reliable cost aggregation at 500M messages/day.
Admin Dashboard Backend	`Node.js (Express) + ClickHouse queries`	Serves aggregated analytics, real-time usage metrics, cost reports, user management, system health monitoring, and model performance dashboards.	Node.js is efficient for the I/O-bound dashboard API pattern. ClickHouse provides sub-second analytical queries over billions of rows for real-time dashboards. The admin backend is a lightweight API layer that translates dashboard queries into optimized ClickHouse SQL.
CDN & Frontend	`CloudFront CDN + Next.js (React)`	Serves the React-based SPA, handles static assets, and provides edge caching for shared conversation pages.	Next.js provides SSR for shared conversation pages (SEO, social previews), static generation for marketing pages, and CSR for the interactive chat UI. CloudFront provides global edge caching with ~20ms latency to users worldwide. React's ecosystem has excellent Markdown rendering libraries (react-markdown, react-syntax-highlighter).
Event Bus	`Apache Kafka (MSK)`	Decouples services by publishing domain events (message_created, conversation_shared, tokens_consumed) for downstream consumers like billing, analytics, search indexing, and notifications.	Kafka handles the 500M+ events/day throughput with ease, provides exactly-once semantics for billing accuracy, supports multiple consumer groups (billing, analytics, search indexer), and offers configurable retention for replay capability. MSK reduces operational burden.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
PostgreSQL (Citus)	`sql`	Primary data store for users, conversations, messages, and billing records. Citus extension enables horizontal sharding by user_id, distributing the 500M messages/day write load across multiple nodes while maintaining strong consistency and ACID transactions within a user's data. Immediate consistency requirement rules out eventually-consistent NoSQL options. Sharding by user_id ensures all conversation data for a user is co-located for efficient joins and queries.
Redis Cluster	`cache`	Multi-purpose caching layer: (1) Conversation context cache — stores the last N messages of active conversations to avoid DB reads on every LLM request, reducing P99 latency. (2) Session/JWT blacklist store for auth. (3) Rate limiting counters with atomic Lua scripts. (4) Celery task broker for file processing. Redis Cluster provides automatic partitioning across 6+ nodes with built-in failover.
Elasticsearch 8.x	`search`	Powers full-text search across conversation history with BM25 relevance scoring and supports vector search (kNN) for semantic similarity. Conversations are indexed asynchronously via Kafka consumers, so search indexing doesn't block the critical write path. Supports nested documents for conversation-message hierarchy and faceted filtering by date, model, folder.
Amazon S3	`blob`	Stores uploaded files (images, PDFs, documents) and conversation export archives. S3 provides 11 nines of durability, lifecycle policies for cost optimization (move old files to Glacier), and presigned URLs for secure direct client uploads. Multipart upload support handles large files efficiently.
Apache Kafka (MSK)	`queue`	Event streaming backbone carrying domain events (message_created, tokens_consumed, file_uploaded, conversation_shared) to downstream consumers. Kafka's partitioned log model supports parallel consumption by billing, search indexer, and analytics pipelines independently. At 500M messages/day, Kafka's throughput (millions of msgs/sec per cluster) provides massive headroom. Exactly-once semantics ensure billing accuracy.
ClickHouse	`sql`	Columnar OLAP database for real-time analytics powering the admin dashboard. Handles aggregation queries over billions of events (messages, token usage, costs) with sub-second response times. MergeTree engine provides efficient time-series storage with automatic data compaction. Chosen over Redshift for lower latency on interactive queries and over Druid for simpler operations.

API Design

Method	Endpoint	Description
`POST`	`/api/v1/auth/login`	Authenticate user with email/password or OAuth token. Returns short-lived JWT access token (15min) and long-lived refresh token. Sets secure httpOnly cookie for refresh token.
`POST`	`/api/v1/auth/refresh`	Exchange a valid refresh token for a new JWT access token. Implements refresh token rotation — old token is invalidated in Redis upon use.
`POST`	`/api/v1/conversations`	Create a new conversation thread. Accepts optional model selection, system prompt, and folder assignment. Returns conversation_id and initial metadata.
`GET`	`/api/v1/conversations`	List user's conversations with pagination, filtering (by folder, date range, model), and sorting. Returns conversation metadata including title, last message timestamp, message count, and model used.
`POST`	`/api/v1/conversations/{conversation_id}/messages`	Send a new user message to a conversation. Triggers LLM completion. Returns message_id and a stream_url for the client to connect to for receiving the streamed response. Accepts optional file attachments by reference (file_ids from upload).
`GET`	`/api/v1/conversations/{conversation_id}/messages`	Retrieve paginated message history for a conversation. Supports cursor-based pagination (before/after message_id). Returns messages with role, content, timestamp, token count, and model info.
`WS`	`/api/v1/stream/{message_id}`	Server-Sent Events (SSE) endpoint for streaming LLM response tokens. Client connects after sending a message. Receives token-by-token events, metadata events (model, token count), and a final done event with complete message and usage stats.
`POST`	`/api/v1/files/upload`	Upload a file (image, PDF, document) for use in conversations. Returns a presigned S3 URL for direct upload and a file_id for referencing in messages. Validates file type and size limits per user tier.
`POST`	`/api/v1/conversations/{conversation_id}/share`	Generate a public sharing link for a conversation. Accepts optional expiration time and whether to include future messages. Returns a unique share URL that can be accessed without authentication.
`GET`	`/api/v1/search`	Full-text search across user's conversation history. Accepts query string, filters (date range, model, folder), and pagination. Returns matching conversations and message snippets with highlighted matches.
`PATCH`	`/api/v1/conversations/{conversation_id}`	Update conversation metadata including title, folder assignment, tags, pinned status, and archive status. Supports partial updates.
`DELETE`	`/api/v1/conversations/{conversation_id}`	Soft-delete a conversation and all its messages. Data is retained for 30 days before permanent deletion. Triggers cleanup of associated search index entries and cached context.
`GET`	`/api/v1/user/usage`	Retrieve current user's usage statistics including tokens consumed today/this month, message count, rate limit status, and quota remaining for their tier.
`GET`	`/api/v1/admin/dashboard/metrics`	Admin-only endpoint returning aggregated platform metrics: DAU, messages/hour, token costs by model, error rates, P99 latencies, active connections, and top users by usage. Powered by ClickHouse queries.

Scalability Strategy

The system employs a multi-layered horizontal scaling strategy designed to handle 20M DAU and 500M messages/day with significant headroom: **Compute Scaling (Kubernetes):** All core services run on Kubernetes (EKS) with Horizontal Pod Autoscaler (HPA) based on CPU, memory, and custom metrics (active connections for Streaming Gateway, queue depth for File Processing). The Streaming Gateway scales based on active WebSocket connections with a target of 40K connections per pod (Go's goroutine efficiency allows this). The LLM Orchestrator scales based on in-flight requests to LLM backends. **Database Scaling (Citus Sharded PostgreSQL):** Conversations and messages are sharded by user_id using Citus, distributing data across 32+ worker nodes. This ensures all data for a single user is co-located (avoiding cross-shard queries) while distributing the 500M daily message writes evenly. Read replicas per shard handle read-heavy workloads (conversation history browsing). Connection pooling via PgBouncer (256 connections per pool) prevents connection exhaustion. **Caching Strategy:** Redis Cluster with 12+ nodes provides the caching layer. Active conversation contexts (last 10 messages) are cached with 1-hour TTL, eliminating ~80% of database reads for the hot path (LLM context assembly). Cache-aside pattern with write-through for conversation metadata ensures consistency. **Event Processing Scaling:** Kafka topics are partitioned by user_id (128 partitions per topic), allowing consumer groups to scale horizontally. Billing consumers run 32 instances processing events in parallel. Search indexer runs 16 instances with bulk indexing to Elasticsearch. **Multi-Region Deployment:** The system deploys in US-East, US-West, and EU-West regions. Each region has its own Streaming Gateway fleet, Kong Gateway, and Redis cache. PostgreSQL uses Citus with the primary write cluster in one region and fast read replicas in others. For users requiring data residency (EU), a fully independent EU cluster is maintained. Global traffic routing via Route53 latency-based routing directs users to the nearest region. **LLM Backend Scaling:** The LLM Orchestrator implements a weighted round-robin across multiple API keys per provider, connection pooling to self-hosted vLLM instances (which auto-scale GPU nodes based on queue depth), and circuit breakers per backend. Self-hosted vLLM runs on p4d.24xlarge instances with auto-scaling groups targeting 70% GPU utilization. **CDN and Static Scaling:** CloudFront serves all static assets and SSR pages from 400+ edge locations. Shared conversation pages are cached at the edge with 5-minute TTL and cache invalidation on update. **Graceful Degradation:** Under extreme load, the system implements progressive degradation: (1) reduce max context window length, (2) disable search indexing temporarily, (3) queue non-streaming requests, (4) serve cached responses for identical recent queries, (5) display wait queue UI rather than errors.

Trade-offs

SSE (Server-Sent Events) for streaming instead of pure WebSocket

✓ SSE works over standard HTTP/2 — no special proxy configuration needed, works through all CDNs and load balancers
✓ Automatic reconnection built into the EventSource API with last-event-id support
✓ Simpler server implementation — unidirectional stream matches the LLM response pattern
✓ Better compatibility with HTTP-based auth (cookies, headers) without custom handshake logic
✓ Easier to load balance since connections are standard HTTP

✗ Unidirectional — cannot send client messages over the same connection (requires separate POST requests)
✗ Limited to ~6 concurrent connections per domain in HTTP/1.1 (mitigated by HTTP/2 multiplexing)
✗ No binary frame support — all data must be text-encoded (acceptable for token streaming)
✗ Some older corporate proxies may buffer SSE events (mitigated by including periodic comments as keep-alive)

PostgreSQL with Citus sharding instead of NoSQL (DynamoDB/Cassandra)

✓ Strong consistency guarantees satisfy the immediate consistency requirement for conversation history
✓ SQL expressiveness enables complex queries for search, filtering, and admin analytics without separate ETL
✓ ACID transactions ensure message ordering and conversation integrity
✓ Citus provides horizontal scaling while preserving PostgreSQL's full feature set (JSONB, CTEs, window functions)
✓ Existing team expertise with PostgreSQL reduces operational risk

✗ Cross-shard queries (e.g., global admin analytics) are more expensive than single-shard queries
✗ Schema migrations on sharded tables require careful coordination
✗ Higher operational complexity compared to fully managed DynamoDB
✗ Connection management requires PgBouncer pooling layer adding another component

Kafka as event bus instead of simpler alternatives (RabbitMQ, SQS)

✓ Supports multiple independent consumer groups — billing, search, analytics all consume the same events
✓ Message replay capability enables reprocessing if a consumer fails or needs reindexing
✓ Exactly-once semantics (with idempotent producers) critical for billing accuracy
✓ Partitioned log model handles 500M+ events/day with low latency
✓ MSK (managed) reduces operational overhead

✗ Higher complexity than SQS/RabbitMQ — requires understanding of partitions, consumer groups, offsets
✗ Minimum 3-broker cluster even for dev/staging environments increases infrastructure cost
✗ Message ordering only guaranteed within a partition (mitigated by partitioning by user_id)
✗ Consumer lag monitoring and rebalancing require operational attention

LiteLLM as the unified LLM abstraction layer

✓ Single interface to 100+ LLM providers reduces integration code significantly
✓ Built-in retry logic, streaming support, and token counting per provider
✓ Easy to add new model backends without changing orchestration code
✓ Active open-source community with frequent updates for new models

✗ Additional abstraction layer adds latency (~5-10ms) to every LLM call
✗ May not expose provider-specific optimizations or features immediately
✗ Dependency on third-party library for critical path — must pin versions carefully
✗ Custom failover logic still needed on top of LiteLLM's built-in retries

Separate Streaming Gateway service (Go) from Conversation Service (Python)

✓ Go handles 50K+ concurrent connections per instance with minimal memory — dramatically reduces infrastructure cost for the connection-heavy streaming workload
✓ Independent scaling — streaming connections scale differently from CRUD API operations
✓ Fault isolation — a crash in conversation logic doesn't drop active streams
✓ Go's deterministic low-latency GC prevents stream stuttering

✗ Two services to maintain for what is conceptually one user action (send message + receive stream)
✗ Coordination complexity — the Conversation Service must signal the Streaming Gateway when to start streaming
✗ Different programming languages increase team skill requirements
✗ Additional network hop between Conversation Service and Streaming Gateway adds ~2ms latency

ClickHouse for analytics instead of extending PostgreSQL or using a data warehouse

✓ Columnar storage provides 10-100x faster analytical queries compared to row-based PostgreSQL
✓ Sub-second query performance on billions of rows enables real-time admin dashboards
✓ Excellent compression (10-20x) reduces storage costs for high-volume event data
✓ Native support for time-series aggregations and materialized views for pre-computed metrics

✗ Another database system to operate and monitor
✗ Not suitable for transactional workloads — purely append-optimized
✗ Limited UPDATE/DELETE capabilities make data corrections cumbersome
✗ Requires separate data pipeline from Kafka to ingest events

Soft-delete conversations with 30-day retention instead of immediate hard delete

✓ Users can recover accidentally deleted conversations within the retention window
✓ Simplifies billing audits — all data is available for the billing period
✓ Background cleanup job can batch-delete efficiently during off-peak hours
✓ Compliance teams can review data before permanent deletion if needed

✗ Increases storage costs — deleted data occupies space for 30 days
✗ All queries must filter on is_deleted flag adding minor overhead
✗ GDPR right-to-erasure may require faster hard deletion for EU users — requires a separate expedited deletion pipeline
✗ Search index must also handle soft-delete filtering

⏱ 342.8s 📅 2/8/2026, 2:24:03 PM 📁 design-chatgpt-claude-sonnet-4-5

Overview

This system design outlines a globally distributed, highly scalable conversational AI platform capable of serving 20 million daily active users with 500 million messages per day. The architecture employs a microservices approach with dedicated services for authentication, conversation management, real-time streaming, and LLM orchestration. The design emphasizes low-latency streaming responses (sub-500ms time to first token), horizontal scalability to support 100K+ concurrent WebSocket connections per region, and robust fault tolerance with automatic LLM backend failover. The system leverages a multi-region deployment with geographic load balancing, employs PostgreSQL with read replicas for durable conversation storage, Redis for session management and caching, and Kafka for asynchronous event processing. A dedicated LLM Gateway service abstracts multiple LLM providers (OpenAI, Anthropic, custom models), implements intelligent routing, rate limiting, and cost tracking. Real-time bidirectional communication is handled via WebSocket connections through a scalable connection manager, while a CDN delivers static assets and cached content globally.

Requirements

Functional

User registration, login, and session management with JWT tokens
Create, read, update, and delete conversation threads
Multi-turn conversations with full context retention across messages
Real-time streaming of LLM responses token-by-token via WebSocket
Support for multiple LLM backends with user-selectable models
Conversation history search and organization (folders, tags, timestamps)
File upload and multimodal input processing (images, PDFs, documents)
Generate shareable public links for conversations with privacy controls
Markdown rendering support including code syntax highlighting
Admin dashboard for usage analytics, cost monitoring, and user management
Rate limiting based on user tier (free, pro, enterprise)
Usage quota enforcement and billing integration

Non-Functional

Support 20 million daily active users and 500 million messages/day
Time to first token (TTFT) must be under 500ms
Handle 100K+ concurrent WebSocket connections per region
99.9% availability with automatic failover for LLM backend failures
Conversation data must be immediately consistent and durable
Horizontal scalability for all stateless services
Geographic distribution across multiple regions for low latency
Per-request cost tracking with 99.99% accuracy for billing
Support message throughput of 5,800 messages/second sustained
Data retention for at least 90 days with archival for older conversations
Security compliance (encryption at rest and in transit, GDPR, SOC2)

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
API Gateway	`Kong Gateway with OpenResty (Nginx + Lua)`	Single entry point for all client requests; handles routing, authentication validation, rate limiting, request/response transformation, and SSL termination	Kong provides high-performance reverse proxy with built-in plugins for authentication, rate limiting, logging, and circuit breaking. Handles 10K+ RPS per instance with horizontal scalability and proven in production at scale.
Authentication Service	`Node.js with Passport.js + Auth0 for identity management`	User registration, login, JWT token issuance and validation, OAuth integration, session management, and user profile management	Auth0 provides enterprise-grade authentication with built-in security features, MFA, social login, and scales automatically. Node.js offers fast token validation and can handle 5K+ auth requests per second per instance.
WebSocket Connection Manager	`Go with Gorilla WebSocket library, deployed on Kubernetes with HPA`	Maintains persistent WebSocket connections, handles connection lifecycle, message routing, presence management, and broadcasts streaming responses to clients	Go excels at concurrent connection handling with lightweight goroutines. Each instance can handle 10K+ concurrent WebSockets with minimal memory overhead. Stateless design allows horizontal scaling based on connection count.
Conversation Service	`Java Spring Boot with Spring Data JPA`	CRUD operations for conversation threads, message persistence, context window management, conversation search, and thread organization	Spring Boot provides mature transaction management, excellent PostgreSQL integration, and strong consistency guarantees. JPA simplifies complex queries for conversation history and search. Battle-tested at enterprise scale.
LLM Gateway Service	`Python with FastAPI and LangChain for LLM orchestration`	Abstracts multiple LLM providers, routes requests to appropriate backends, handles streaming, implements retry logic with exponential backoff, tracks costs per request, and provides automatic failover	Python ecosystem has best LLM library support (OpenAI SDK, Anthropic SDK, transformers). FastAPI provides async streaming support essential for token-by-token delivery. LangChain simplifies multi-provider integration and context management.
File Processing Service	`Python with Celery for async processing, Tesseract for OCR, PyPDF2 for PDF parsing`	Handles file uploads, validates file types and sizes, extracts text from documents (OCR, PDF parsing), processes images for vision models, and stores files in object storage	Python has rich libraries for document processing and image manipulation. Celery provides distributed task queue for async processing of large files without blocking API responses. Can scale workers independently based on queue depth.
Search Service	`Elasticsearch with custom analyzers for semantic search`	Indexes conversation content, provides full-text search across message history, supports filtering by date, model, and tags	Elasticsearch provides sub-second full-text search across billions of documents. Supports complex queries, filtering, and aggregations. Can be extended with vector embeddings for semantic search. Scales horizontally with sharding.
Rate Limiter Service	`Redis with Lua scripts for atomic rate limiting operations`	Enforces per-user and per-tier rate limits, quota management, token bucket algorithm implementation, and communicates with billing service	Redis provides in-memory performance (<1ms latency) essential for rate limit checks on every request. Lua scripts ensure atomic operations for token bucket algorithms. Redis Cluster provides high availability and scales to millions of users.
Analytics & Monitoring Service	`ClickHouse for OLAP analytics with Grafana for visualization`	Collects usage metrics, tracks costs per request and per user, monitors system health, generates reports for admin dashboard	ClickHouse excels at high-volume time-series analytics with billions of rows, providing sub-second query performance for dashboards. Columnar storage reduces costs. Grafana provides rich visualization for admin dashboards.
Notification Service	`Node.js with SendGrid for email, Firebase Cloud Messaging for push`	Sends email notifications, push notifications, and in-app alerts for quota limits, system updates, and shared conversations	SendGrid provides reliable email delivery with analytics. FCM supports cross-platform push notifications. Node.js event-driven architecture handles high-volume async notifications efficiently.
Share Service	`Go with Redis for link metadata caching`	Generates unique shareable links for conversations, manages privacy settings and expiration, renders public conversation views	Go provides fast link generation and validation. Redis caches share metadata to avoid database lookups on every public link access. Stateless design allows easy scaling for viral shared conversations.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
PostgreSQL 15 with Citus extension for horizontal sharding	`sql`	Primary datastore for users, conversations, messages, and relationships. Citus enables horizontal sharding by user_id to handle billions of messages. JSONB support for flexible message metadata. Strong ACID guarantees ensure conversation consistency. Read replicas handle query load.
Redis Cluster	`cache`	Multi-purpose: JWT session storage, rate limiting counters, conversation context caching, WebSocket connection metadata, and hot conversation cache. Sub-millisecond latency critical for rate limiting and session validation. Redis Cluster provides automatic sharding and replication.
Elasticsearch 8.x	`search`	Full-text search across conversation history. Handles complex queries with filters, highlighting, and relevance scoring. Inverted indexes provide fast search across billions of messages. Can be extended with kNN for semantic search using embeddings.
Amazon S3 with CloudFront CDN	`blob`	Stores uploaded files (images, documents), exported conversations, and shared conversation snapshots. S3 provides 99.999999999% durability, lifecycle policies for cost optimization, and versioning. CloudFront accelerates file delivery globally.
Apache Kafka	`queue`	Event streaming backbone for async processing: analytics events, usage tracking, cost calculation, audit logs, and notification triggers. Kafka provides durable message storage, replay capability, and scales to millions of events per second. Decouples producers from consumers.
ClickHouse	`nosql`	Time-series analytics database for usage metrics, cost tracking, and admin dashboards. Optimized for OLAP queries with aggregations across billions of rows. Columnar storage provides 10-100x compression. Real-time ingestion from Kafka.

API Design

Method	Endpoint	Description
`POST`	`/api/v1/auth/register`	Register a new user account with email and password, returns JWT access and refresh tokens
`POST`	`/api/v1/auth/login`	Authenticate user credentials and issue JWT tokens with user tier information
`POST`	`/api/v1/conversations`	Create a new conversation thread, returns conversation_id and initial metadata
`GET`	`/api/v1/conversations/{conversation_id}`	Retrieve full conversation thread with all messages, supports pagination and filtering
`POST`	`/api/v1/conversations/{conversation_id}/messages`	Send a new message in a conversation, triggers LLM processing, returns message_id for tracking
`WS`	`/ws/v1/stream`	WebSocket endpoint for real-time bidirectional communication, streams LLM responses token-by-token, handles connection lifecycle
`GET`	`/api/v1/conversations/search`	Full-text search across user's conversation history with filters for date range, model, and tags
`POST`	`/api/v1/files/upload`	Upload files for multimodal input, supports images and documents up to 50MB, returns file_id and processing status
`POST`	`/api/v1/conversations/{conversation_id}/share`	Generate a public shareable link for a conversation with configurable expiration and privacy settings
`GET`	`/api/v1/models`	List available LLM models with capabilities, pricing, and context window information
`GET`	`/api/v1/users/me/usage`	Get current user's usage statistics, quota consumption, and rate limit status
`GET`	`/api/v1/admin/analytics/usage`	Admin endpoint for aggregated usage metrics, costs by model, and active user statistics
`DELETE`	`/api/v1/conversations/{conversation_id}`	Soft delete a conversation thread, marks as deleted but retains for recovery period

Scalability Strategy

**Horizontal Scaling Approach:** 1. **Stateless Services**: All application services (API Gateway, Conversation Service, LLM Gateway, Auth Service, WebSocket Manager) are stateless and containerized with Kubernetes. Auto-scaling policies based on CPU (70% threshold) and custom metrics (concurrent connections for WS Manager, queue depth for File Processing). 2. **WebSocket Connection Distribution**: Each WebSocket Manager instance handles 10K concurrent connections. With 100K target per region, deploy 10+ instances with sticky session routing at the load balancer level using consistent hashing on user_id. Connection metadata stored in Redis allows any instance to route messages. 3. **Database Sharding**: PostgreSQL with Citus extension shards data by user_id across 16 initial shards, expandable to 64+. Each shard handles ~1.25M users. Read replicas (3 per shard) distribute query load. Message tables partitioned by created_at (monthly) for efficient archival. 4. **LLM Gateway Scaling**: Python FastAPI instances scaled based on request queue depth in Kafka. Each instance maintains connection pools to external LLM APIs (OpenAI, Anthropic) with circuit breakers. Geographic proximity routing to LLM endpoints reduces latency. 5. **Caching Strategy**: Redis Cluster with 12 nodes (4 shards × 3 replicas) caches: conversation contexts (30min TTL), user sessions (24hr), rate limit counters (1hr sliding window), hot conversations (top 10% by access). Cache hit rate target: 85%+. 6. **Multi-Region Deployment**: Deploy across 3 regions (US-East, EU-West, Asia-Pacific) with Route53 geo-routing. Each region handles 7M DAU. Cross-region PostgreSQL replication (async) for disaster recovery. Kafka MirrorMaker 2 replicates events for analytics aggregation. **Vertical Scaling Considerations:** - PostgreSQL instances: Start with r6g.4xlarge (16 vCPU, 128GB RAM), scale to r6g.8xlarge for primary. Read replicas on r6g.2xlarge. - Redis Cluster: r6g.xlarge nodes (4 vCPU, 32GB RAM per node). - LLM Gateway: CPU-optimized c6i.2xlarge for fast Python execution. - ClickHouse: Storage-optimized i3en.2xlarge for cost-effective analytics. **Capacity Planning for 500M messages/day**: ~5,800 msgs/sec sustained, 12K msgs/sec peak. Each LLM Gateway instance handles 50 concurrent requests × 20 regions × 10 instances = 10K concurrent LLM requests. Over-provision by 50% for traffic spikes and failover capacity.

Trade-offs

WebSocket for real-time streaming vs Server-Sent Events (SSE)

✓ Bidirectional communication allows client to cancel requests mid-stream
✓ Lower latency for streaming tokens (no HTTP overhead per message)
✓ Better for interactive features like typing indicators and presence
✓ Single persistent connection reduces connection overhead

✗ More complex infrastructure with stateful connection management
✗ Requires sticky sessions and connection state tracking in Redis
✗ Harder to debug and monitor compared to stateless HTTP
✗ Load balancer configuration more complex (TCP vs HTTP)
✗ Higher memory consumption per connection on server side

PostgreSQL with Citus sharding vs fully distributed database (Cassandra/DynamoDB)

✓ Strong ACID guarantees ensure conversation consistency across multi-turn interactions
✓ Complex relational queries for conversation threads, user relationships, and search
✓ Mature ecosystem with excellent tooling, monitoring, and operational knowledge
✓ JSONB support provides schema flexibility for message metadata without sacrificing SQL
✓ Citus provides transparent sharding while maintaining PostgreSQL compatibility

✗ Harder to scale writes compared to eventually consistent NoSQL databases
✗ Requires careful shard key selection (user_id) to avoid hot partitions
✗ Cross-shard queries (e.g., admin analytics) are more expensive
✗ Higher operational complexity for managing sharding compared to managed NoSQL
✗ Potential single points of failure if primary shard goes down (mitigated with replicas)

Python FastAPI for LLM Gateway vs Go/Java

✓ Best ecosystem for LLM libraries (OpenAI, Anthropic, LangChain, transformers)
✓ Native async/await support in FastAPI ideal for streaming responses
✓ Rapid development and easy integration with ML/AI tooling
✓ LangChain provides abstraction for multi-provider LLM orchestration
✓ Python's expressiveness reduces code complexity for prompt engineering

✗ Lower raw throughput compared to Go or Java (GIL limitations)
✗ Higher memory consumption per request (~50MB vs ~5MB for Go)
✗ Slower cold start times if using serverless deployment
✗ Requires more instances to achieve same throughput as compiled languages
✗ Dependency management more fragile (pip vs Go modules)

Kafka for event streaming vs direct database writes with triggers

✓ Decouples message processing from analytics, allowing independent scaling
✓ Event replay capability for backfilling analytics or debugging
✓ Enables multiple consumers (analytics, search indexing, notifications) without coupling
✓ Buffer for traffic spikes - prevents overwhelming downstream systems
✓ Provides audit log for compliance and debugging

✗ Additional infrastructure complexity and operational overhead
✗ Eventual consistency - analytics may lag real-time by seconds
✗ Higher storage costs for event retention (30 days = ~15TB for 500M msgs/day)
✗ Requires monitoring for consumer lag and rebalancing
✗ Increases overall system latency for end-to-end event processing

Multi-region active-active deployment vs active-passive

✓ Lower latency for global users by serving from nearest region
✓ Higher availability - no failover delay if region goes down
✓ Better resource utilization - all regions handle traffic simultaneously
✓ Enables geographic compliance (EU data stays in EU region)

✗ Complex data consistency challenges for global user state (sessions, rate limits)
✗ Higher infrastructure costs running full stack in multiple regions
✗ Cross-region latency for database replication can cause eventual consistency issues
✗ More complex deployment and testing (must validate across all regions)
✗ Potential for split-brain scenarios if network partitions occur

Redis for rate limiting vs database-based rate limiting

✓ Sub-millisecond latency critical for checking limits on every request
✓ Atomic operations with Lua scripts prevent race conditions in token bucket algorithm
✓ In-memory performance scales to millions of rate limit checks per second
✓ TTL support automatically cleans up expired rate limit windows
✓ Doesn't add load to primary transactional database

✗ Data loss risk if Redis instance fails (mitigated with Redis Cluster replication)
✗ Higher cost per GB compared to disk-based storage
✗ Requires separate infrastructure component to maintain and monitor
✗ Complex to guarantee exactly-once semantics during failover scenarios
✗ Memory constraints may require eviction policies that could lose rate limit state

⏱ 39.4s 📅 2/8/2026, 1:52:44 PM 📁 design-chatgpt-gemini-3-flash-preview

Overview

This system design describes a globally distributed, high-concurrency platform similar to ChatGPT, capable of handling 20M DAU and 500M messages per day. The architecture focuses on low-latency streaming (TTFT < 500ms), immediate consistency for conversation history, and high availability across multiple LLM backends through an intelligent inference orchestration layer. It utilizes an event-driven model for background tasks like cost tracking and search indexing, while maintaining persistent connections for real-time interaction.

Requirements

Functional

User authentication and session management
Multi-turn conversation with stateful context management
Real-time token streaming via Server-Sent Events (SSE)
Global conversation search and organization
Support for multiple LLM providers (OpenAI, Anthropic, internal models)
Multimodal support (Image/Document processing)
Public conversation sharing via UUID-masked URLs
Admin monitoring for cost and model performance

Non-Functional

Scale: 20 million daily active users
Latency: Time To First Token (TTFT) under 500ms
Concurrency: 100k+ active connections per region
Durability: Immediate consistency for conversation storage
Reliability: Automatic failover between LLM backends
Scalability: Horizontal scaling for all stateless services

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
Global Load Balancer	`Google Cloud Load Balancing or AWS Global Accelerator`	Routes traffic to the nearest geographic region and handles SSL termination.	Provides low-latency entry points and sophisticated health-checking across global regions.
Edge Gateway / API Gateway	`Kong or Envoy`	Handles authentication, rate limiting (per-tier), and request routing.	High-performance proxy that supports custom plugins for quota management and JWT validation.
Chat & Context Service	`Go (Golang)`	Orchestrates chat logic, manages conversation state, and formats prompts.	Golang's concurrency model (goroutines) is ideal for managing thousands of simultaneous streaming connections with low memory overhead.
Inference Orchestrator	`Custom microservice (Python/FastAPI or Go)`	Routes requests to LLM backends, handles retries, circuit breaking, and failover.	Decouples the chat logic from specific LLM APIs, allowing for dynamic weight shifting and cost optimization.
Streaming Engine	`Server-Sent Events (SSE) over HTTP/2`	Maintains persistent connections for pushing tokens to the client.	SSE is more efficient than WebSockets for unidirectional streaming from server to client and handles reconnections natively.
Usage & Billing Service	`Apache Flink`	Tracks token consumption and costs per user/request for real-time quota enforcement.	Required for real-time stream processing of token counts to prevent over-usage beyond quotas.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
PostgreSQL (with Citus)	`sql`	Ensures ACID compliance and immediate consistency for chat history. Citus allows horizontal sharding to handle 500M messages/day.
Redis	`cache`	Used for session management and caching recent conversation context to minimize DB hits during active turns.
Elasticsearch	`search`	Provides full-text search capabilities over millions of conversations with complex filtering (by date, model, or folder).
Amazon S3 / Google Cloud Storage	`blob`	Durable storage for multimodal inputs (images, PDF documents) and exported chat logs.
Apache Kafka	`queue`	Decouples chat streaming from analytical/billing tasks. Ensures that slow storage or billing updates do not block the user response.

API Design

Method	Endpoint	Description
`POST`	`/v1/auth/login`	Authenticates user and returns a JWT session token.
`POST`	`/v1/chat/completions`	Primary endpoint for sending messages. Supports 'stream: true' for SSE responses.
`GET`	`/v1/conversations`	Retrieves a paginated list of the user's conversation history.
`POST`	`/v1/conversations/{id}/share`	Generates a public, read-only URL for a specific conversation thread.
`POST`	`/v1/files/upload`	Uploads multimodal content; returns a file ID for inclusion in chat completions.
`GET`	`/v1/models`	Lists available LLM backends and their specific capabilities (e.g., vision, long context).

Scalability Strategy

The system scales horizontally at the service level using Kubernetes. The Chat Service and Inference Orchestrator are stateless, allowing auto-scaling based on CPU/Memory and concurrent connection counts. Database scalability is achieved through PostgreSQL sharding on 'user_id' to ensure data locality for a single user's history. Regional data isolation is used to meet 100k connection requirements per region, while a global Redis layer or DB replication handles shared state like public links.

Trade-offs

SSE over WebSockets

✓ Lower overhead on the server
✓ Better support for HTTP/2 multiplexing
✓ Built-in reconnection logic

✗ Unidirectional (server to client) only
✗ Requires separate REST calls for client-side interruptions

Sharded PostgreSQL for History

✓ Strong consistency for conversation turns
✓ Relational schema fits organized folders/tags well

✗ More complex to manage than NoSQL like DynamoDB
✗ Requires careful shard key selection to avoid hotspots

Inference Orchestrator Abstraction

✓ Seamless failover between providers
✓ Centralized cost and rate limit tracking

✗ Additional network hop (approx 10-20ms latency)
✗ Single point of failure if not properly redundant

⏱ 85.3s 📅 2/8/2026, 1:55:11 PM 📁 design-chatgpt-gemini-3-pro-preview

Overview

A distributed, event-driven architecture designed to support 20M+ DAU for a ChatGPT-like application. The system leverages persistent WebSocket connections for low-latency streaming (TTFT < 500ms), a Model Orchestration Layer to abstract various LLM backends, and a tiered storage strategy (Redis -> DynamoDB -> S3) to handle the high write throughput of 500M messages per day. The design prioritizes interactivity and durability while ensuring strict cost governance and rate limiting.

Requirements

Functional

User authentication (SSO, MFA) and session management.
Real-time streaming of LLM responses via WebSockets.
Multi-turn conversation context management.
Model switching (e.g., GPT-4, Claude, Llama 3) per conversation.
Multimodal input handling (Images, PDF upload) via S3.
Conversation history management (Create, Rename, Delete, Archive).
Full-text search across conversation history.
Public link generation for sharing conversations.
Admin dashboard for cost tracking and user management.

Non-Functional

Latency: Time to First Token (TTFT) < 500ms.
Concurrency: Support 100k+ active WebSocket connections per region.
Availability: 99.99% uptime with multi-region failover.
Scalability: Horizontal scaling to handle 500M messages/day.
Durability: Zero data loss for conversation history.
Consistency: Immediate consistency for active chat, eventual consistency for search.
Billing Accuracy: Precise token counting for usage quotas.

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
Edge Gateway / API Gateway	`AWS Application Load Balancer + Kong Gateway`	SSL termination, Geo-routing, Rate limiting, Authentication verification.	Kong provides robust plugin support for rate-limiting (Token Bucket) and JWT validation before traffic hits internal services.
Connection Manager (Chat Service)	`Go (Golang) on Kubernetes`	Manages WebSocket connections, broadcasts stream chunks, handles user state.	Go's Goroutines are ideal for handling hundreds of thousands of concurrent WebSocket connections with low memory footprint compared to Node.js or Python.
Model Orchestrator	`Python (FastAPI) with LangChain adapters`	Standardizes API calls to different LLM providers, handles retry logic, and failover.	Python ecosystem has the best libraries for LLM integration. Isolating this allows independent scaling based on inference latency.
Context Assembly Service	`Rust Microservice`	Retrieves relevant chat history and injects system prompts/RAG context before inference.	Requires extremely low latency to fetch and tokenize text before sending to the LLM to meet the 500ms TTFT constraint.
Billing & Analytics Consumer	`Apache Flink`	Consumes completed message events to calculate costs and update quotas.	Stateful stream processing needed to aggregate token usage in real-time for strict quota enforcement.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
Amazon DynamoDB	`nosql`	Primary store for Chat History. Supports massive write throughput (500M msgs/day) and efficient querying by Partition Key (ConversationID) and Sort Key (Timestamp).
Redis Cluster	`cache`	Stores active session state, recent conversation context (window), and user rate limit counters to minimize latency on the critical path.
Amazon S3	`blob`	Storage for user-uploaded images/documents. Low cost, high durability, and allows offloading bandwidth via Presigned URLs.
PostgreSQL	`sql`	Stores structured relational data: User profiles, Organization hierarchies, Billing Invoices, and configuration settings.
Elasticsearch / OpenSearch	`search`	Provides full-text search capabilities over chat history, which DynamoDB cannot handle efficiently.

API Design

Method	Endpoint	Description
`WS`	`/ws/v1/chat`	Main WebSocket endpoint for bi-directional streaming of prompts and LLM responses.
`POST`	`/v1/conversations`	Creates a new conversation thread, returns conversation_id.
`GET`	`/v1/conversations/{id}/messages`	Retrieves paginated message history for a specific conversation.
`GET`	`/v1/models`	Lists available LLM models user is authorized to use.
`POST`	`/v1/files/upload-url`	Generates a presigned S3 URL for uploading images or documents.

Scalability Strategy

Horizontal scaling via Kubernetes HPA based on CPU and custom metrics (Active WebSocket Connections). Database scales via DynamoDB On-Demand capacity or provisioned capacity with auto-scaling. The system is sharded by ConversationID for data locality. Redis Cluster handles hot-path reads. A Queue-based decoupling (Kafka) allows background tasks (search indexing, analytics) to scale independently of the real-time chat service.

Trade-offs

WebSockets over Server-Sent Events (SSE)

✓ Bi-directional capability allows users to interrupt generation mid-stream.
✓ Better support for future features like real-time voice or collaborative editing.

✗ More complex load balancing and state management on the server.
✗ Issues with corporate firewalls compared to standard HTTP/SSE.

DynamoDB for History (NoSQL) vs PostgreSQL

✓ Predictable low-latency performance at infinite scale (20M DAU).
✓ Schema flexibility for evolving message metadata (e.g., adding citations).

✗ Complex queries (e.g., search) are impossible, requiring a secondary indexer (Elasticsearch).
✗ Higher cost per GB compared to compressed cold storage in SQL/S3.

Async Token Counting (Post-generation)

✓ Does not add latency to the streaming response.
✓ Simplifies the hot path architecture.

✗ Risk of minor quota overages if a user spams requests before the counter updates.
✗ Complexity in reconciling partial streams if a connection drops.

⏱ 320.9s 📅 2/8/2026, 10:54:53 AM 📁 design-chatgpt-gpt-5

Overview

A globally distributed, real-time conversational AI platform supporting multi-turn chat with multiple LLM backends, multimodal inputs, and rich history management. The system is built for 20M DAUs and 500M messages/day with sub-500ms time-to-first-token via a high-performance WebSocket gateway, an LLM routing layer with fast failover, and region-affine, strongly consistent storage for conversation history. Analytics, cost tracking, and admin observability are first-class through an events pipeline into ClickHouse and Prometheus/Grafana. The core data plane is stateless, horizontally scalable on Kubernetes, and tolerant of provider or regional failures.

Requirements

Functional

User authentication (OAuth/social SSO) and session management
Create/read/update/delete multi-turn conversations and messages with context retention
Real-time streaming of assistant responses over WebSocket (token-by-token)
Conversation history: search, star, tag, foldering, archive, delete
Multiple model selection across providers and in-house inference
Rate limiting and quota enforcement by user/tier and per-model
File upload for images/documents with virus scan, OCR/text extraction, and multimodal prompt support
Share conversations via public links with configurable visibility
Admin dashboard: model/provider health, usage, cost, errors, rate limits/quota states
Cost tracking per request for accurate billing and cost attribution

Non-Functional

Time-to-first-token (TTFT) < 500ms p95
Support ≥100K concurrent WebSocket connections per region
Immediate consistency for conversation/message persistence
Regional fault tolerance and LLM provider failover with graceful degradation
Horizontal scalability to 500M messages/day (60k msgs/sec peak)
Data durability (multi-AZ), point-in-time recovery, backups
Security: WAF, DDoS protection, encryption in transit/at rest, least-privilege IAM
Observability: distributed tracing, metrics, logs, audit trails
Privacy and compliance readiness (GDPR/CCPA data subject controls)
Cost efficiency: autoscaling compute/GPU, cost-aware routing, storage tiering

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
Client Web App	`Next.js + React, TypeScript, WebSocket, highlight.js, Markdown-it`	SPA for chat UI, Markdown rendering, code highlighting, WebSocket streaming, file uploads, search, and sharing	Mature ecosystem, SSR for SEO (shared links), excellent dev productivity and performance
Edge CDN/WAF & Global LB	`Cloudflare CDN + Cloudflare Load Balancer + Bot Management`	TLS termination, caching static assets, DDoS/WAF, geo-steering to nearest healthy region	Global footprint, anycast, robust WAF and health-based geo-routing to meet latency and availability targets
API & WebSocket Gateway	`Go microservice on Kubernetes with NGINX Ingress (ALB) and HTTP/2; gorilla/websocket; gRPC to internal services`	Single entry for REST and WebSocket; authZ/authN checks, rate limiting, session validation, request fan-out to internal services; streams tokens to client	Go delivers low-latency IO and high concurrency; stable WS handling; NGINX Ingress + ALB scale well
Auth Service	`Auth0 (OIDC) + JWT (RS256)`	User identity, OAuth/social login, JWT issuance, refresh tokens, RBAC/roles (user/admin)	Fast to integrate, enterprise SSO, adaptive MFA; offloads identity risk; standards-compliant OIDC
Rate Limit & Quota Service	`Envoy Global Rate Limit Service + Redis Cluster; Lua in NGINX for shadow checks`	Enforces per-user/tier rate limits (sliding window) and quotas; provides near-real-time counters	Envoy RLS is battle-tested; Redis offers sub-ms counters and atomicity with Lua scripts
Session/Cache Store	`Redis Cluster (6.x) with Redis Streams for ephemeral events`	JWT blacklist, session metadata, ephemeral streaming buffers, recent context window cache	In-memory speed, high availability via clustering and replication
Conversation Service	`PostgreSQL (Citus) multi-tenant sharded by user_id; Go service using pgx`	CRUD for conversations/messages, context building, sharing ACLs, foldering/tags; transactional writes	Immediate consistency and SQL semantics; Citus scales horizontally and keeps p95 low with partitioning
Search/Indexing Service	`OpenSearch (multi-az) + k-NN plugin; background workers (Go) for indexing`	Full-text search over titles/messages; semantic search via embeddings; indexing pipeline	Scalable search with near real-time indexing; k-NN for semantic search without extra vector DB
LLM Router	`Go service with gobreaker, HTTP/2 keep-alive pools; provider SDKs; configuration via Consul/etcd`	Model catalog, routing to providers/in-house; health checks, circuit breakers, retries, cost-aware selection; streaming token multiplexing	Low-latency, robust control plane with per-provider health and dynamic routing rules
Provider Connectors	`Connectors for "OpenAI/Anthropic/Azure OpenAI/Google Vertex"; retries with exponential backoff; streaming adapters`	Integrations to external LLMs and embeddings	Diversity reduces provider risk and enables cost/performance optimization
In-house Inference Cluster	`vLLM on Kubernetes GPU nodes (NVIDIA A10/A100), Triton for embeddings; Istio for mTLS`	Self-hosted models (vLLM) for failover and cost control; embeddings server	High throughput, streaming-friendly; cost-efficient for baseline models and embeddings
File Ingestion Service	`Amazon S3 + S3 Object Lambda (virus scan with ClamAV) + AWS Textract + Apache Tika; Step Functions for orchestration`	Pre-signed uploads, virus scanning, OCR/text extraction, chunking; links assets to messages	Serverless pipeline scales elastically; S3 durability and cost efficiency for blobs
Cost & Billing Service	`Kafka consumers (Go) -> ClickHouse for analytics; Postgres for authoritative balances`	Compute per-request cost (provider rates, tokens, GPU time), store usage, expose invoices and quotas	ClickHouse excels at high-ingest analytics; Postgres for transactional balances and limits
Event Bus	`Apache Kafka (AWS MSK)`	Asynchronous events: usage, costs, audit logs, indexing triggers	High-throughput, durable event streaming; ecosystem support
Analytics & Monitoring	`Prometheus + Grafana; OpenTelemetry + Jaeger; Loki for logs; CloudWatch for infra`	Dashboards, alerts, traces, logs	Proven OSS stack, vendor-neutral instrumentation
Admin Dashboard	`Next.js + RBAC; reads from ClickHouse/Prometheus/Postgres`	Operational UI: usage, costs, errors, provider health, throttles; model catalog management	Unified operational control plane with low-latency analytics queries
CDN Assets & Static Hosting	`Cloudflare + S3 static site hosting`	Serve static JS/CSS/images	Global low-latency delivery for assets

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
PostgreSQL (Citus)	`sql`	Strong consistency and transactions for conversations/messages; Citus provides horizontal sharding by user_id with high write throughput and low-latency queries
Redis Cluster	`cache`	Sub-millisecond counters for rate limits, sessions, ephemeral streaming buffers, and hot context windows
Amazon S3	`blob`	Durable, cost-effective storage for file uploads, images, and large attachments; lifecycle policies for tiering
OpenSearch	`search`	Full-text and semantic search with k-NN; scalable indexing and near real-time search for conversation history
Kafka (AWS MSK)	`queue`	Durable, scalable event streaming for usage, billing, indexing, and audit logs decoupling producers/consumers
ClickHouse	`sql`	High-ingest, columnar analytics for usage and cost reporting; sub-second aggregations at scale

API Design

Method	Endpoint	Description
`WS`	`/v1/ws`	Bidirectional WebSocket for sending user messages and receiving token-streaming responses and events
`POST`	`/v1/conversations`	Create a new conversation (title, tags, model selection, visibility)
`GET`	`/v1/conversations`	List conversations with filters (folder, tag, starred) and pagination
`GET`	`/v1/conversations/{id}`	Get a conversation with messages (server-side pagination)
`POST`	`/v1/conversations/{id}/messages`	Add a user message to a conversation (text, file refs, tool calls)
`GET`	`/v1/messages/{id}`	Get message detail and streaming status
`GET`	`/v1/search`	Search conversations/messages (full-text + semantic options)
`GET`	`/v1/models`	List available models and tiers, pricing metadata
`POST`	`/v1/files`	Initiate file upload and get pre-signed URL; returns file_id
`POST`	`/v1/share/{conversation_id}`	Create/update share link (public/unlisted/expire)
`GET`	`/v1/usage`	Per-user usage and remaining quota by period
`GET`	`/v1/admin/metrics`	Admin: provider health, error rates, throughput, cost summaries
`PUT`	`/v1/admin/models`	Admin: manage model catalog, routing weights, and availability

Scalability Strategy

- Traffic and sessions: Anycast via Cloudflare to nearest region. Sticky session not required; WebSocket connections are long-lived and evenly distributed via ALB. Gateway pods autoscale on CPU and open FDs; each Go pod targets ~4–5K concurrent WS; 30 pods suffice for 150K WS with headroom per region. - Storage: Citus shards by user_id across nodes; co-locate primary and replicas in same AZ to minimize latency. Connection pooling with PgBouncer. Hot partitions handled by rebalancing shards. PITR and logical replication to DR region. - Search: OpenSearch domain scales horizontally across data nodes. Index with 1–3 primary shards per index and ILM for rollover. Async indexers consume from Kafka for sustained throughput. - LLM routing: Health probes and circuit breakers per provider/region; latency-aware load balancing and hedged requests before first token. In-house vLLM autoscaling on GPU metrics (queue depth, tokens/sec). Keep-alive HTTP/2 pools to reduce TTFB. - Rate limiting: Redis Cluster with hash tags for per-user keys ensures single-shard updates. Use sliding window with Lua for atomicity. Quotas aggregated periodically from ClickHouse and persisted to Postgres for authority. - WebSockets: Separate HPA based on concurrent connections and network IO. Use SO_REUSEPORT and pod anti-affinity. Idle pings to detect dead peers. Backpressure controls to avoid OOM. - Multi-region: Active-active per region; users are region-affined based on home region stored in profile. Cross-region failover via Cloudflare LB health checks; if home region is down, read last durable snapshot from DR and write to degraded mode (queue for backfill) with user-notice. - Observability: OpenTelemetry traces propagate across gateway, LLM router, connectors. SLO-based autoscaling and alerting for TTFT and error budgets.

Trade-offs

Use PostgreSQL (Citus) for conversations instead of a globally-distributed DB (e.g., Spanner/CockroachDB)

✓ Immediate consistency and strong SQL semantics
✓ Operational familiarity and cost-effective scaling
✓ Shard-by-user keeps latency low

✗ Cross-region strong consistency not automatic
✗ Requires shard management and regional affinity policies

WebSockets for streaming vs Server-Sent Events

✓ Bidirectional control channel (cancel, tool calls)
✓ Better concurrency characteristics and fewer HTTP headers
✓ Widely supported by CDNs/LBs

✗ More complex connection lifecycle management
✗ Stateful long-lived connections impact autoscaling

OpenSearch for full-text + k-NN vs separate vector DB

✓ Single search stack reduces ops overhead
✓ Near real-time indexing with scalable ingestion

✗ k-NN recall/latency may trail specialized vector DBs
✗ Index tuning and memory footprint can be heavy

Auth0 for identity instead of building in-house

✓ Faster time-to-market, robust SSO/MFA
✓ Security updates and compliance handled by vendor

✗ Vendor cost and dependency
✗ Rate limits and token introspection latency considerations

Kafka + ClickHouse for analytics and cost tracking

✓ High-ingest, low-cost analytics at scale
✓ Flexible schema evolution for events

✗ Operational complexity (clusters to manage)
✗ Eventual consistency for analytics views

In-house vLLM cluster as secondary routing target

✓ Cost control for common models
✓ Resilience to third-party outages

✗ Requires GPU capacity management and model maintenance
✗ May lag on latest proprietary model capabilities

Async indexing to OpenSearch (eventual) while DB is immediate

✓ Protects write path latency and TTFT
✓ Scales independently

✗ Search results may lag by seconds
✗ Requires reconciliation jobs for rare failures

⏱ 138.3s 📅 2/8/2026, 10:57:24 AM 📁 design-chatgpt-gpt-5-mini

Overview

A globally distributed, real-time web platform that enables multi-turn conversations with configurable LLM backends, streaming responses token-by-token, durable conversation history, multimodal inputs, per-user quotas and billing, and admin monitoring. The design uses managed cloud components where appropriate (AWS examples used for concreteness) and is built for 20M DAU and ~500M messages/day — with multi-region deployment, autoscaling WebSocket clusters, strong consistency for conversation data, semantic search, and resilient LLM backend routing with automatic failover and cost accounting.

Requirements

Functional

User authentication, registration, password reset and session management
Create and continue multi-turn conversation threads with context retention
Real-time streaming of LLM responses token-by-token to clients
Durable and immediately consistent conversation history (read/write immediately consistent)
Search and organize conversation history (text + semantic search)
Support multiple LLM backends and per-conversation model selection
Rate limiting and per-tier usage quotas; block/soft-limit enforcement
Markdown rendering with safe sanitization (code blocks, tables, etc.)
File upload and multimodal input handling (images, documents) with safe storage and processing
Share conversations via public links (read-only) with optional expiry
Admin dashboard for usage, costs, quota management, and system health
Per-request cost tracking for accurate billing

Non-Functional

Scale to 20M daily active users and 500M messages/day
Support at least 100K concurrent WebSocket connections per region
Start streaming first token within 500ms of request
Conversation history must be durable and immediately consistent
High availability and graceful degradation on LLM backend failures with automatic failover
Low latency (P95 request response times within reasonable bounds) and high throughput
Secure file handling, sanitization, and access controls
Observability: request tracing, per-request cost telemetry, metrics and logs
Regulatory considerations: data residency and GDPR-friendly features (export/delete)

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
Edge / CDN	`AWS CloudFront + AWS WAF (or Cloudflare) for global CDN & edge security`	Global caching, TLS termination, hosting static assets, and routing to nearest API region. Protect against DDoS and serve prerendered content.	Low latency global content delivery and edge protections. CloudFront integrates with regional ALBs and AWS API Gateway; WAF provides DDoS/IPS rules.
API Gateway (REST & WebSocket)	`AWS API Gateway (HTTP/APIGW v2 for WebSocket) or AWS Application Load Balancer + NLB for WebSocket if custom stack preferred`	Ingress point for HTTP(S) REST APIs and managed WebSocket connections, authentication/authorization integration, metrics, and throttling.	Managed API Gateway handles large scale WebSocket connections reliably and integrates with Lambda and VPC targets. Reduces operational burden to meet 100K+ connections per region.
Auth Service / Identity	`Auth0 or Amazon Cognito (or self-hosted Keycloak for more control)`	User authentication (email/password, OAuth), session issuance, token lifecycle, MFA, and account management.	Managed identity reduces time to market; Cognito/Auth0 handle scaling, OIDC/OAuth flows, social login, and integrate with API Gateway and IAM. Can fallback to Keycloak if self-hosting required for compliance.
Frontend (Web & Mobile clients)	`React + Next.js for Web (SSR), React Native for mobile; use WebSocket & SSE clients for streaming. Use remark/rehype for markdown rendering and DOMPurify for sanitization.`	UI for conversations, streaming UI, markdown rendering/sanitization, file uploads, sharing links, offline behaviors, and websocket clients.	Next.js gives performant SSR/CSR mix and edge support; well-supported libraries for markdown and security.
Connection Manager / WebSocket Workers	`Kubernetes (EKS) running horizontally scaled WebSocket worker pods behind API Gateway or ALB, using Envoy/ingress for routing. Use Redis for presence/connection metadata.`	Maintain WebSocket connections, route tokens to clients, enforce per-connection rate limits, maintain ephemeral state, and connect to LLM streaming output.	Kubernetes provides autoscaling and lifecycle control. Breaking stream work into worker pods allows streaming token-by-token with low-latency writes to sockets; Redis stores connection mapping for routing in multi-pod deployments.
Conversation Service (API)	`Stateless microservice in Kubernetes (gRPC/HTTP) with connection to Aurora PostgreSQL (Primary writer) and a caching layer (Redis).`	Handles conversation CRUD, multi-turn context assembly, versioning, bookmarks, shareable link creation, and immediate persistent writes.	Stateless services scale easily. Aurora PostgreSQL provides strong consistency and supports high write throughput with multi-AZ. Redis accelerates hot path reads and rate-limit checks.
Message Ingest & Streaming Orchestrator (LLM Router)	`Stateless microservice (Kubernetes) using gRPC to LLM backends; feature-rich router capability using Hystrix-like circuit breaker libraries and per-model adapters. Persist logs & events to Kafka (MSK) for downstream processing.`	Orchestrates sending prompts to selected LLM backend(s), streams tokens back to Connection Manager, calculates per-request cost, applies circuit-breakers and failover to alternate models/backends, and logs telemetry.	Centralized routing simplifies failover, cost accounting, and policy enforcement. gRPC yields low-latency backend calls; Kafka provides durable eventing for billing and analytics.
LLM Backends	`Hybrid: External providers (OpenAI/Anthropic) + Internal GPU clusters orchestrated by Kubernetes + Triton / NVIDIA TensorRT / Ray Serve for model serving. Use model proxies that expose gRPC or HTTP streaming.`	Provide model inference and token streaming. Could be managed external APIs (OpenAI, Anthropic) and/or internal GPU clusters (private models).	Hybrid provides capacity and cost controls: external for burst/spiky loads and internal for steady-state/private models. Triton/Ray Serve are production-ready for large model serving with streaming support.
Cache & Rate-Limit Store	`Redis (Amazon ElastiCache in clustered mode with clustering-enabled Redis or Redis Enterprise)`	Fast token-bucket rate limits, session cache, short-lived conversation caches for hot reads, and presence store.	Redis supports very low-latency operations, atomic counters, Lua scripting for rate-limiting logic, and clustering for scale.
Durable Storage (Conversations / Metadata / Billing)	`Amazon Aurora PostgreSQL (clustered, multi-AZ, read-replicas) with partitioning/sharding by tenant or hashed conversation id.`	Immediate-consistency primary store for conversations, messages, user metadata, billing records, and access controls.	Relational strong consistency and transactions for immediate-consistency requirement; Aurora scales reads and provides high durability and automated backups.
Object Store (Files & Attachments)	`Amazon S3 with S3 Object Lambda hooks; presigned uploads; Lambda for scanning via ClamAV or third-party virus scanning`	Store uploaded files (images, docs) and serve them to model pipelines and clients via presigned URLs; lifecycle & virus-scan results.	S3 is durable, scalable, and cost-effective; presigned uploads offload bandwidth; Lambda-based scanning pipeline can be used asynchronously.
Search & Embeddings	`Hybrid: OpenSearch (for keyword/structured search) + Vector DB (Pinecone, Milvus, or Amazon OpenSearch vector plugin) for embeddings. Use a managed embedding service or produce embeddings via dedicated model instances and store vectors in vector DB.`	Text and semantic search across conversation history and attachments; embedding generation and vector search.	OpenSearch handles traditional search and filters; vector DB supports semantic similarity at scale. Separating concerns lets us scale search independently.
Event Bus / Streaming & Analytics	`Apache Kafka (Amazon MSK) for high-throughput durable logs; Kafka Connect to data warehouse (Snowflake/BigQuery) and stream processors (Flink/Kafka Streams).`	Durable eventing for audit logs, billing events, metrics, and asynchronous jobs (indexing, notifications, cost aggregation).	Kafka scales well for hundreds of thousands of events/sec and supports exactly-once processing patterns enabling accurate billing and analytics.
Billing & Cost Accounting	`Service that consumes Kafka billing events, applies per-model cost rates, stores detailed line-items in PostgreSQL and aggregates in OLAP (BigQuery/Snowflake) for reports. Use serverless ETL for daily aggregation.`	Accurate per-request cost tracking, aggregation to user billing, tier enforcement, and exports to billing system.	Event-driven accounting keeps near-real-time cost tracking for each request; OLAP enables fast analytics and admin dashboards.
Admin Dashboard & Observability	`Prometheus + Grafana for metrics; Jaeger for distributed traces; ELK/OpenSearch for logs; Grafana dashboards with role-based access. Admin frontend built on React + RBAC.`	System metrics, alerts, per-user/tier usage, cost dashboards, model-health, and structured logs.	Standard observability stack with tracing allows operators to debug and monitor the system and analyze cost/usage.
Security & Compliance	`AWS KMS for secrets, IAM for infra access control, Vault (HashiCorp) for application secrets if self-hosting; S3 encryption and TLS everywhere.`	Access controls, secret management, key rotation, audit logs, data deletion/export endpoints, encryption at rest/in transit, and DLP for file scanning.	Managed key stores and RBAC minimize operational overhead while meeting compliance.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
Amazon Aurora PostgreSQL	`sql`	Provides immediate consistency, transactions, and durability for conversation history and billing line-items. Aurora supports multi-AZ, read-replicas, partitioning/sharding, and scales to high throughput with proper schema design.
Redis (ElastiCache Clustered)	`cache`	Low-latency data for rate-limiting, session/presence mapping, token-bucket counters, and ephemeral caching of recent conversation context for fast reads.
Amazon S3	`blob`	Durable, cost-efficient object storage for user uploaded files and model artifacts. Supports presigned uploads and lifecycle policies; integrates with object-lambda for scanning/transformations.
Apache Kafka (Amazon MSK)	`queue`	Durable, high-throughput event stream for message events, billing events, and indexing streams. Enables decoupled asynchronous processing (search indexing, billing aggregation, analytics).
OpenSearch (Elastic) + Vector DB (Pinecone or Milvus)	`search`	OpenSearch for keyword/structured search and filters; vector DB for semantic similarity search on embeddings. Scales independently and supports fast retrieval of relevant conversation segments.
OLAP (BigQuery or Snowflake)	`nosql`	For cost/billing analytics and historical reporting at scale. Stores aggregated billing/usage records and enables fast analytics for admin dashboards and finance exports.

API Design

Method	Endpoint	Description
`POST`	`/api/v1/auth/login`	Authenticate user (email/password or OAuth token exchange). Returns access token and refresh token. Initiates session and rate-limit metadata.
`GET`	`/api/v1/conversations`	List user conversations with pagination, sorting, and filters (by tag, model, shared). Uses read-replica; consistent with write-through caching invalidation.
`POST`	`/api/v1/conversations`	Create a new conversation; specify model, system prompt, privacy/sharing options, and optional attachments.
`POST`	`/api/v1/conversations/{conversationId}/messages`	Send a new user message to a conversation. Persists message, triggers inference via LLM Router, and returns inference-id. Supports multimodal references (file IDs).
`WS`	`/api/v1/conversations/{conversationId}/stream`	WebSocket endpoint for real-time streaming of LLM responses (token-by-token) and message events. Supports client acknowledgements, reconnect/resume semantics, and server pings.
`POST`	`/api/v1/files`	Request presigned URL for upload or upload metadata. After upload, file is scanned asynchronously; returns file ID for model input.
`GET`	`/api/v1/models`	List available models with capabilities, estimated cost/token, latency SLAs, and fallback rules.
`POST`	`/api/v1/conversations/{conversationId}/share`	Create a public, shareable link (read-only) with optional expiry and password protect settings.
`GET`	`/api/v1/admin/metrics`	Admin-only metrics endpoint aggregated from Prometheus/OLAP for usage, costs, model health, and alerts. Requires admin RBAC.

Scalability Strategy

Multi-region deployment with region-local clusters (API Gateway + EKS + Aurora in each region or read-only replicas cross-region depending on data residency). Horizontal scaling: stateless frontends and LLM Router scale via Kubernetes HPA/KEDA based on CPU/RPS/queue length. WebSocket workers scale horizontally; use managed API Gateway or ALB to handle connection scaling. Redis is scaled as a clustered ElastiCache with sharding; Aurora can be scaled by sharding conversations by tenant or hashing conversationId to different writer clusters for write throughput. Use Kafka (MSK) partitions scaled by throughput and consumer groups for parallel processing. Use autoscaling GPU pools for internal model serving (using Karpenter/Cluster Autoscaler) and spot instances to reduce cost for non-critical capacity. Employ edge caching (CloudFront) for static assets and read-heavy metadata. For search and embeddings, scale vector DB clusters independently. For global throughput, employ traffic steering to nearest region with failover and active-passive or active-active DB strategy where legal/regulatory constraints permit.

Trade-offs

Use Aurora PostgreSQL (SQL) as primary conversation store versus a NoSQL store

✓ Strong consistency and transactional integrity meeting immediate-consistency requirement
✓ Familiar SQL tooling for analytics and billing joins
✓ ACID semantics make concurrency around multi-turn context safer

✗ Higher cost at scale and more complex sharding strategy required for very high write throughput
✗ Scaling writes requires sharding/partitioning; operational complexity compared to unlimited-scaled NoSQL

Managed API Gateway + managed WebSocket vs self-hosted WebSocket tier

✓ Reduces operational burden and more predictable scalability to meet 100K+ connections per region
✓ Integrated auth/metrics and DDoS protections

✗ Potentially higher cost and less low-level customization than self-hosted approach
✗ Proprietary limits and vendor lock-in

Hybrid LLM Backends (external providers + internal GPU clusters)

✓ Flexibility: use cheaper internal models for baseline load and external providers for burst or advanced models
✓ Avoids total dependency on third-party providers and gives control over privacy/compliance

✗ Complexity in routing, capacity planning, and maintaining model infra
✗ Potentially higher ops cost to run GPU clusters

Kafka (MSK) for events vs serverless queues (Kinesis/Lambda)

✓ Kafka provides high throughput, ordering guarantees, and complex stream processing needed for billing accuracy
✓ Good for exactly-once or at-least-once semantics required by billing and audit trails

✗ Operational complexity and higher management overhead vs serverless alternatives
✗ Higher cost at small scale; more moving parts

Use Redis for rate-limiting and session store

✓ Extremely low latency and atomic operations (Lua) for precise rate-limiting
✓ Widely adopted pattern and straightforward to implement token-bucket quotas

✗ Single point of failure risk if not configured in clustered/high-availability mode
✗ Operational overhead to scale and tune eviction/persistence

Vector DB (Pinecone/Milvus) for semantic search vs using OpenSearch alone

✓ Vector DB optimized for nearest-neighbor semantic search at scale and supports high-dimensional vectors
✓ Better latency and scalability for similarity queries

✗ Adds another datastore to maintain and sync embeddings (operational overhead)
✗ Extra cost; integration complexity with indexing pipelines

⏱ 221.0s 📅 2/8/2026, 1:59:52 PM 📁 design-chatgpt-gpt-5.2

Overview

The system is a globally distributed, multi-tenant conversational AI web application supporting authenticated users, multi-turn threads, token-streaming responses, file/multimodal inputs, conversation search, sharing links, and an admin cost/usage dashboard. It is designed for 20M DAU and ~500M messages/day with strict latency requirements (TTFT < 500ms) and high concurrency (>=100K concurrent WebSocket connections per region). The architecture separates the latency-critical request/streaming path (WebSocket Gateway + Orchestrator + LLM adapters) from durable storage, indexing, analytics, and billing pipelines. Conversations are stored in a strongly consistent SQL store, while search and analytics are powered by specialized systems. LLM backend failures are handled via circuit breakers, hedged requests, and provider failover with per-token streaming preserved.

Requirements

Functional

User authentication (SSO/email), session management, and tier entitlements
Create/read/update conversation threads with multi-turn context retention
Real-time token-by-token streaming of assistant responses to the client
Model selection per conversation/message across multiple LLM backends
Conversation history browsing, organization (folders/tags), and search
File upload (images/documents) and multimodal prompts
Share conversations via public links with optional redaction/permissions
Rate limiting and usage quotas per user/tier with enforcement
Markdown rendering support (code blocks, tables) and safe sanitization
Admin dashboard for monitoring usage, latency, errors, and costs

Non-Functional

Scale: 20M DAU, 500M messages/day, avg 10 turns/conversation
Latency: streaming must start within 500ms time-to-first-token
Concurrency: >=100K concurrent WebSocket connections per region
Durability + immediate consistency for conversation history
High availability with automatic failover across LLM providers/regions
Accurate per-request/per-token cost tracking for billing
Security: encryption in transit/at rest, least privilege, audit logs
Compliance readiness (PII controls, retention policies, GDPR delete)
Operational excellence: observability, alerting, safe deploys (canary)
Abuse prevention: bot detection, prompt injection/file malware scanning

Architecture Diagram

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Components

Component	Technology	Responsibility	Justification
Web Client (Browser) + Mobile	`Next.js (React) + TypeScript; Markdown-it + DOMPurify; WebSocket client`	UI for chat, conversation list, model selection, file upload, markdown rendering, and real-time streaming display via WebSocket/SSE fallback.	Next.js supports SSR/SPA, fast iteration, and edge-friendly deployments. Markdown-it is extensible for code blocks/tables; DOMPurify prevents XSS. WebSocket enables low-latency bidirectional streaming.
Global DNS + CDN/WAF	`Cloudflare (DNS, CDN, WAF, Bot Management)`	Global traffic steering, TLS termination at edge, DDoS protection, caching of static assets, WAF rules, and bot mitigation.	Strong global presence reduces latency and protects origin. Bot/DDoS controls are critical at 20M DAU.
API Gateway / Edge	`Envoy Gateway (Kubernetes) + Cloudflare origin rules`	Routing for REST APIs and WebSocket upgrades, auth pre-checks, request shaping, and regional failover.	Envoy provides high-performance L7 routing, retries, timeouts, and observability. Works well with WebSockets and service mesh patterns.
Auth & Session Service	`Auth0 (OIDC) + internal Session API using JWT (short-lived) + Redis for session revocation`	User signup/login, OAuth/OIDC, session issuance, refresh, MFA support, and entitlement lookup for tiers.	Auth0 reduces security risk and time-to-market. Short-lived JWT minimizes DB calls; Redis enables immediate revocation/ban.
WebSocket Gateway (Streaming Gateway)	`Kubernetes-deployed Node.js (uWebSockets.js) or Go (fasthttp + websocket) service; Redis Cluster for ephemeral connection metadata`	Manages WebSocket connections, fan-out of token streams, backpressure, connection state, and regional scaling to >=100K concurrent connections.	Specialized gateway isolates long-lived connections from general API traffic. Go/uws handle high concurrency efficiently; Redis supports lightweight presence/state without coupling to DB.
Chat Orchestrator Service	`Go microservice (gRPC internally) with circuit breakers (hystrix-like) and retries (Envoy + app-level)`	Core chat workflow: validate quotas, build context, call LLM backends, stream tokens, handle tool/file references, persist messages atomically, and emit usage/cost events.	Go offers predictable latency and high throughput. Central orchestration simplifies consistency and billing correctness while keeping streaming path tight.
LLM Provider Adapter Layer	`Internal service/library used by Orchestrator; supports OpenAI-compatible streaming + Bedrock + Anthropic; optional self-hosted vLLM on GPU nodes`	Uniform interface for multiple model providers (e.g., OpenAI, Anthropic, AWS Bedrock, self-hosted vLLM), token streaming normalization, automatic failover/hedging, and provider-specific auth.	Decouples product from provider APIs and enables rapid switching, routing, and fallback strategies to meet availability/latency constraints.
Conversation Service	`Java/Kotlin (Spring Boot) or Go; PostgreSQL-compatible distributed SQL (YugabyteDB)`	CRUD for conversations, messages, metadata (title, tags, folders), share settings, and immediate-consistency reads.	Distributed SQL provides strong consistency with horizontal scaling and multi-region resilience. A dedicated service encapsulates schema and access patterns.
Search/Indexing Service	`Elasticsearch (managed, e.g., Elastic Cloud) + Kafka Connect for indexing pipeline`	Index conversation/message text and metadata for fast search, filtering, and ranking; supports near-real-time updates.	Elasticsearch is well-suited for full-text search and faceting at large scale. Kafka-based ingestion decouples indexing from the write path.
File Ingestion & Multimodal Pipeline	`S3-compatible object storage (Amazon S3) + CloudFront signed URLs; ClamAV scanning; Apache Tika for parsing; optional GPU service for vision embeddings`	Handle uploads, virus/malware scanning, document parsing (PDF/DOCX), image preprocessing, OCR, embedding generation, and secure storage/links.	Object storage is the standard for large binary data. Scanning and parsing protect the platform. Signed URLs reduce origin load and limit unauthorized access.
Rate Limiting & Quota Service	`Redis Cluster (token bucket/leaky bucket) + internal Quota API; optional Envoy global rate limit service`	Per-user/per-tier rate limits (RPS), token quotas, daily/monthly usage, and enforcement in the hot path.	Redis offers sub-millisecond counters suitable for the 500ms TTFT constraint. Central policy keeps enforcement consistent across gateways.
Usage/Cost Metering Service	`Kafka + stream processing (Apache Flink) + ClickHouse for analytics + PostgreSQL ledger tables`	Compute accurate costs per request (tokens in/out, model pricing, file processing costs), generate billing-grade ledgers, and expose aggregates to admin/user dashboards.	Flink enables real-time aggregation while a PostgreSQL ledger ensures correctness and auditability. ClickHouse supports high-QPS analytics for dashboards.
Sharing Service	`Go service + PostgreSQL (YugabyteDB) + CDN caching for public read views`	Create public share links, snapshot/redaction, access control, and view tracking.	Share links require durable mapping and permissions. CDN accelerates read-heavy public access.
Admin & Observability Stack	`Prometheus + Grafana; OpenTelemetry + Tempo/Jaeger; Loki; Sentry; Argo Rollouts for canary`	Monitoring, tracing, logging, incident response, and admin dashboard for usage/cost/latency/provider health.	Standard cloud-native observability with strong ecosystem; canary reduces risk when changing critical streaming paths.
Message Bus / Event Backbone	`Apache Kafka (managed, e.g., Confluent Cloud)`	Decouple write path from indexing, analytics, notifications, and offline processing.	Kafka scales to very high throughput (500M messages/day) and enables replayable event streams for multiple consumers.

Data Flow

⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.

Data Storage

Store	Type	Justification
YugabyteDB (PostgreSQL-compatible distributed SQL)	`sql`	Strong consistency and durability with horizontal scaling and multi-region replication; ideal for immediately consistent conversation history and share-link metadata.
Redis Cluster	`cache`	Sub-millisecond counters for rate limiting/quota enforcement; session revocation; ephemeral WebSocket connection metadata.
Apache Kafka	`queue`	High-throughput event backbone to decouple indexing, analytics, metering, and async file processing from the latency-critical chat path.
Amazon S3 (Object Storage)	`blob`	Durable, scalable storage for user uploads (images/documents) and generated artifacts; integrates with signed URLs and lifecycle policies.
Elasticsearch	`search`	Full-text search with faceting for conversation history at scale, supporting near-real-time indexing from Kafka.
ClickHouse	`nosql`	High-performance OLAP for admin/user dashboards on usage, costs, latency, and provider performance.
PostgreSQL (Billing Ledger)	`sql`	Billing-grade immutable ledger entries require strict constraints, transactions, and auditability; kept separate from high-volume chat OLTP.

API Design

Method	Endpoint	Description
`POST`	`/v1/auth/session`	Exchange OIDC code for application session (JWT/refresh), return user profile and tier entitlements.
`POST`	`/v1/conversations`	Create a new conversation (optionally with selected model, system prompt, folder/tags).
`GET`	`/v1/conversations/{conversationId}`	Fetch conversation metadata and messages with strong consistency (latest turns).
`POST`	`/v1/conversations/{conversationId}/messages`	Send a user message (non-streaming fallback) and receive the assistant response when complete.
`WS`	`/v1/ws/chat`	WebSocket endpoint for streaming chat. Client sends message frames; server streams tokens/events (delta tokens, tool/file status, final).
`POST`	`/v1/files`	Request an upload session; returns signed upload URL(s) and fileId(s).
`GET`	`/v1/files/{fileId}`	Fetch file metadata and processing status (scanned/parsed/ready).
`GET`	`/v1/search`	Search conversations/messages by query, filters (date, model, tags), and pagination.
`POST`	`/v1/share`	Create a public share link for a conversation snapshot with optional redaction rules.
`GET`	`/v1/share/{shareId}`	Retrieve shared conversation snapshot for public viewing (read-only).
`GET`	`/v1/usage`	Return current usage, remaining quotas, and recent cost estimates for the authenticated user.
`GET`	`/v1/admin/metrics`	Admin-only: aggregated metrics (DAU, messages, token volume, costs, provider error rates/latency).

Scalability Strategy

Global active-active deployment across multiple regions (at least 3) with GeoDNS steering to nearest healthy region. WebSocket Gateways scale horizontally behind Envoy with connection-aware load balancing; keep services stateless and store only ephemeral connection metadata in Redis. The hot path (quota check, context fetch, LLM streaming) is optimized for TTFT by: (1) precomputing and caching conversation summaries, (2) limiting context window with rolling summarization, (3) parallelizing context fetch and file metadata fetch, and (4) using hedged requests to LLM providers after a short delay when p95 latency rises. Conversation history writes are strongly consistent using distributed SQL with synchronous replication and tuned transaction boundaries (persist user message immediately; persist assistant message incrementally with periodic checkpoints, then finalize). Kafka decouples indexing/analytics and supports replay. Elasticsearch scales by sharding by tenant/time; ClickHouse scales by distributed tables and partitioning by date/model. Rate limiting uses Redis Cluster with key hashing by userId to spread load; per-tier policies are cached at gateways. For 500M messages/day, partition Kafka topics by conversationId hash, and use consumer groups for Search and Metering pipelines. LLM adapters implement circuit breakers, bulkheads per provider, and region-aware routing; self-hosted vLLM provides a fallback capacity pool for reliability and cost control.

Trade-offs

Use YugabyteDB (distributed SQL) for conversation history instead of DynamoDB/Cassandra

✓ Strong consistency and SQL transactions simplify immediate-consistency requirements
✓ Secondary indexes and relational modeling for conversations/messages/shares
✓ Multi-region replication and HA with familiar Postgres ecosystem

✗ Higher operational complexity and cost than single-region Postgres
✗ Write latency can increase with synchronous multi-region replication
✗ Careful schema/partition design needed to avoid hotspots

WebSocket Gateway as a separate tier from REST API services

✓ Optimized for long-lived connections and high concurrency (100K+ per region)
✓ Isolates streaming workloads from standard API traffic
✓ Simplifies backpressure handling and connection lifecycle management

✗ Additional component to operate and secure
✗ More complex debugging across gateway-orchestrator boundary

Redis-based quota enforcement in the hot path

✓ Very low latency suitable for TTFT < 500ms
✓ Supports token bucket algorithms and tier-based policies
✓ Reduces load on primary databases

✗ Distributed counters require careful design for correctness (race conditions)
✗ Redis outages can block traffic unless graceful degradation is implemented

Kafka event-driven pipelines for search indexing and cost analytics

✓ Decouples latency-critical chat from heavy indexing/analytics
✓ Enables replay, backfills, and multiple consumers
✓ Handles very high throughput (500M messages/day)

✗ Eventual consistency for search/analytics (not for core conversation reads)
✗ Requires schema governance and exactly-once/at-least-once considerations

Elasticsearch for conversation search

✓ Best-in-class full-text search, faceting, and relevance tuning
✓ Scales horizontally via sharding and replicas
✓ Rich query DSL for product features

✗ Operational overhead: shard sizing, reindexing, cluster tuning
✗ Index lag (seconds) unless aggressively tuned

LLM adapter with automatic failover and hedged requests

✓ Improves availability and tail latency under provider issues
✓ Abstracts provider-specific streaming formats and pricing
✓ Supports routing by cost/performance/tier

✗ Complexity in maintaining consistent user experience across providers
✗ Risk of duplicate costs with hedged requests if not carefully canceled
✗ Provider output differences can affect response consistency

Billing-grade cost ledger in PostgreSQL separate from OLTP conversation store

✓ Strong auditability and immutability patterns for billing
✓ Protects core conversation store from analytics/billing query load
✓ Simplifies reconciliation and dispute handling

✗ Data duplication and additional ETL/stream processing
✗ Requires reconciliation logic between provider usage and internal metering

S3 + signed URLs for file uploads and downloads

✓ Highly scalable and cost-effective for large binary storage
✓ Offloads bandwidth from application services
✓ Supports lifecycle policies and encryption controls

✗ Requires careful access control to prevent link leakage
✗ Additional pipeline complexity for scanning/parsing and status tracking