Generated on 2/8/2026, 3:19:56 PM · 7 run(s) across 1 problem(s)
This system is a large-scale conversational AI platform serving 20 million daily active users generating 500 million messages per day. The architecture follows a microservices pattern with clear separation between the real-time streaming layer, conversation management, LLM orchestration, and supporting services. The core design centers on a WebSocket-based streaming gateway that delivers token-by-token responses with sub-500ms time-to-first-token, backed by an LLM orchestration layer that abstracts multiple model backends (OpenAI, Anthropic, self-hosted) with automatic failover. Conversations are persisted in a sharded PostgreSQL cluster for immediate consistency, with Redis caching for hot conversation context, and S3 for file/multimodal uploads. The system is designed for multi-region deployment with regional WebSocket gateways, global CDN for static assets, and a robust rate-limiting and billing pipeline that tracks per-request token costs. Key architectural decisions include using Server-Sent Events (SSE) over WebSocket for streaming simplicity, CQRS for separating write-heavy message ingestion from read-heavy history/search workloads, and an event-driven architecture via Kafka for decoupling billing, analytics, and audit concerns from the critical path. The admin dashboard is powered by a dedicated analytics pipeline built on ClickHouse for real-time usage monitoring and cost attribution.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| API Gateway | Kong Gateway (on Kubernetes) | Entry point for all client requests. Handles TLS termination, request routing, authentication verification, rate limiting enforcement, and load balancing across backend services. | Kong provides built-in rate limiting, JWT validation, request transformation, and plugin ecosystem. It handles both HTTP and WebSocket upgrade requests, supports declarative config via Kubernetes CRDs, and scales horizontally. Preferred over AWS API Gateway for lower latency and more control over WebSocket handling. |
| Streaming Gateway | Custom Go service with nhooyr/websocket | Manages long-lived SSE/WebSocket connections for real-time token streaming from LLM backends to clients. Handles connection lifecycle, heartbeats, backpressure, and reconnection. | Go excels at handling massive concurrent connections with minimal memory overhead (goroutines use ~4KB vs threads). A custom service allows precise control over backpressure, connection draining, and graceful failover. Each instance can handle 50K+ concurrent connections, needing only 2-3 instances per region for 100K target. |
| Auth Service | Node.js with Passport.js + Redis session store | User registration, login (email/password, Google OAuth, GitHub OAuth), JWT issuance and refresh, session management, and password reset flows. | Passport.js has mature OAuth provider integrations. Node.js is well-suited for I/O-bound auth workflows. Redis stores refresh tokens and session blacklists for O(1) lookups. JWTs are short-lived (15min) with Redis-backed refresh tokens for revocation capability. |
| Conversation Service | Python (FastAPI) | Core business logic for creating conversations, appending messages, managing conversation metadata (titles, folders, tags), and serving conversation history with pagination. | FastAPI provides async support, automatic OpenAPI docs, and excellent Python ecosystem integration for ML/AI tooling. Python aligns with the broader AI/ML ecosystem making it easy to integrate tokenizers, prompt engineering libraries, and model-specific utilities. |
| LLM Orchestrator | Python (FastAPI) with LiteLLM | Abstracts multiple LLM backends, handles model routing based on user selection, manages prompt assembly with conversation context, implements retry/failover logic, and streams tokens back to the Streaming Gateway. | LiteLLM provides a unified interface to 100+ LLM providers (OpenAI, Anthropic, Cohere, self-hosted vLLM). FastAPI's async streaming support enables efficient token forwarding. The orchestrator implements circuit breaker patterns per backend and automatic failover when a provider returns errors or exceeds latency thresholds. |
| File Processing Service | Python with Celery workers | Handles file upload, validation, virus scanning, format conversion, image resizing, OCR for documents, and preparing multimodal inputs for LLM consumption. | File processing is CPU-intensive and variable in duration — Celery workers can scale independently. Python has excellent libraries for image processing (Pillow), PDF extraction (PyMuPDF), and OCR (Tesseract). Workers pull from a Redis-backed task queue for reliable processing. |
| Search Service | Elasticsearch 8.x | Full-text search across conversation history, semantic search for finding relevant past conversations, and powering the organization/filtering UI. | Elasticsearch provides fast full-text search with relevance scoring, supports nested document structures ideal for conversations with messages, and offers built-in vector search (kNN) for semantic search. The inverted index is highly optimized for the search-heavy read pattern of conversation history. |
| Rate Limiter & Quota Service | Redis Cluster with Lua scripts | Enforces per-user, per-tier rate limits (requests/min, tokens/day), tracks usage quotas, and signals the API gateway to throttle or reject requests. | Redis provides sub-millisecond rate limit checks using sliding window counters implemented via Lua scripts for atomicity. Redis Cluster enables horizontal scaling. Token bucket and sliding window algorithms are implemented for different rate limiting needs (burst vs sustained). |
| Billing & Cost Tracking Service | Go service consuming from Kafka | Records per-request token usage and costs, aggregates billing data per user/organization, generates invoices, and feeds cost data to the admin dashboard. | Go provides the performance needed for high-throughput event processing. Kafka consumption decouples billing from the critical request path — if billing is slow, it doesn't affect user experience. Go's strong typing and low GC pauses ensure accurate, reliable cost aggregation at 500M messages/day. |
| Admin Dashboard Backend | Node.js (Express) + ClickHouse queries | Serves aggregated analytics, real-time usage metrics, cost reports, user management, system health monitoring, and model performance dashboards. | Node.js is efficient for the I/O-bound dashboard API pattern. ClickHouse provides sub-second analytical queries over billions of rows for real-time dashboards. The admin backend is a lightweight API layer that translates dashboard queries into optimized ClickHouse SQL. |
| CDN & Frontend | CloudFront CDN + Next.js (React) | Serves the React-based SPA, handles static assets, and provides edge caching for shared conversation pages. | Next.js provides SSR for shared conversation pages (SEO, social previews), static generation for marketing pages, and CSR for the interactive chat UI. CloudFront provides global edge caching with ~20ms latency to users worldwide. React's ecosystem has excellent Markdown rendering libraries (react-markdown, react-syntax-highlighter). |
| Event Bus | Apache Kafka (MSK) | Decouples services by publishing domain events (message_created, conversation_shared, tokens_consumed) for downstream consumers like billing, analytics, search indexing, and notifications. | Kafka handles the 500M+ events/day throughput with ease, provides exactly-once semantics for billing accuracy, supports multiple consumer groups (billing, analytics, search indexer), and offers configurable retention for replay capability. MSK reduces operational burden. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| PostgreSQL (Citus) | sql | Primary data store for users, conversations, messages, and billing records. Citus extension enables horizontal sharding by user_id, distributing the 500M messages/day write load across multiple nodes while maintaining strong consistency and ACID transactions within a user's data. Immediate consistency requirement rules out eventually-consistent NoSQL options. Sharding by user_id ensures all conversation data for a user is co-located for efficient joins and queries. |
| Redis Cluster | cache | Multi-purpose caching layer: (1) Conversation context cache — stores the last N messages of active conversations to avoid DB reads on every LLM request, reducing P99 latency. (2) Session/JWT blacklist store for auth. (3) Rate limiting counters with atomic Lua scripts. (4) Celery task broker for file processing. Redis Cluster provides automatic partitioning across 6+ nodes with built-in failover. |
| Elasticsearch 8.x | search | Powers full-text search across conversation history with BM25 relevance scoring and supports vector search (kNN) for semantic similarity. Conversations are indexed asynchronously via Kafka consumers, so search indexing doesn't block the critical write path. Supports nested documents for conversation-message hierarchy and faceted filtering by date, model, folder. |
| Amazon S3 | blob | Stores uploaded files (images, PDFs, documents) and conversation export archives. S3 provides 11 nines of durability, lifecycle policies for cost optimization (move old files to Glacier), and presigned URLs for secure direct client uploads. Multipart upload support handles large files efficiently. |
| Apache Kafka (MSK) | queue | Event streaming backbone carrying domain events (message_created, tokens_consumed, file_uploaded, conversation_shared) to downstream consumers. Kafka's partitioned log model supports parallel consumption by billing, search indexer, and analytics pipelines independently. At 500M messages/day, Kafka's throughput (millions of msgs/sec per cluster) provides massive headroom. Exactly-once semantics ensure billing accuracy. |
| ClickHouse | sql | Columnar OLAP database for real-time analytics powering the admin dashboard. Handles aggregation queries over billions of events (messages, token usage, costs) with sub-second response times. MergeTree engine provides efficient time-series storage with automatic data compaction. Chosen over Redshift for lower latency on interactive queries and over Druid for simpler operations. |
| Method | Endpoint | Description |
|---|---|---|
POST | /api/v1/auth/login | Authenticate user with email/password or OAuth token. Returns short-lived JWT access token (15min) and long-lived refresh token. Sets secure httpOnly cookie for refresh token. |
POST | /api/v1/auth/refresh | Exchange a valid refresh token for a new JWT access token. Implements refresh token rotation — old token is invalidated in Redis upon use. |
POST | /api/v1/conversations | Create a new conversation thread. Accepts optional model selection, system prompt, and folder assignment. Returns conversation_id and initial metadata. |
GET | /api/v1/conversations | List user's conversations with pagination, filtering (by folder, date range, model), and sorting. Returns conversation metadata including title, last message timestamp, message count, and model used. |
POST | /api/v1/conversations/{conversation_id}/messages | Send a new user message to a conversation. Triggers LLM completion. Returns message_id and a stream_url for the client to connect to for receiving the streamed response. Accepts optional file attachments by reference (file_ids from upload). |
GET | /api/v1/conversations/{conversation_id}/messages | Retrieve paginated message history for a conversation. Supports cursor-based pagination (before/after message_id). Returns messages with role, content, timestamp, token count, and model info. |
WS | /api/v1/stream/{message_id} | Server-Sent Events (SSE) endpoint for streaming LLM response tokens. Client connects after sending a message. Receives token-by-token events, metadata events (model, token count), and a final done event with complete message and usage stats. |
POST | /api/v1/files/upload | Upload a file (image, PDF, document) for use in conversations. Returns a presigned S3 URL for direct upload and a file_id for referencing in messages. Validates file type and size limits per user tier. |
POST | /api/v1/conversations/{conversation_id}/share | Generate a public sharing link for a conversation. Accepts optional expiration time and whether to include future messages. Returns a unique share URL that can be accessed without authentication. |
GET | /api/v1/search | Full-text search across user's conversation history. Accepts query string, filters (date range, model, folder), and pagination. Returns matching conversations and message snippets with highlighted matches. |
PATCH | /api/v1/conversations/{conversation_id} | Update conversation metadata including title, folder assignment, tags, pinned status, and archive status. Supports partial updates. |
DELETE | /api/v1/conversations/{conversation_id} | Soft-delete a conversation and all its messages. Data is retained for 30 days before permanent deletion. Triggers cleanup of associated search index entries and cached context. |
GET | /api/v1/user/usage | Retrieve current user's usage statistics including tokens consumed today/this month, message count, rate limit status, and quota remaining for their tier. |
GET | /api/v1/admin/dashboard/metrics | Admin-only endpoint returning aggregated platform metrics: DAU, messages/hour, token costs by model, error rates, P99 latencies, active connections, and top users by usage. Powered by ClickHouse queries. |
The system employs a multi-layered horizontal scaling strategy designed to handle 20M DAU and 500M messages/day with significant headroom: **Compute Scaling (Kubernetes):** All core services run on Kubernetes (EKS) with Horizontal Pod Autoscaler (HPA) based on CPU, memory, and custom metrics (active connections for Streaming Gateway, queue depth for File Processing). The Streaming Gateway scales based on active WebSocket connections with a target of 40K connections per pod (Go's goroutine efficiency allows this). The LLM Orchestrator scales based on in-flight requests to LLM backends. **Database Scaling (Citus Sharded PostgreSQL):** Conversations and messages are sharded by user_id using Citus, distributing data across 32+ worker nodes. This ensures all data for a single user is co-located (avoiding cross-shard queries) while distributing the 500M daily message writes evenly. Read replicas per shard handle read-heavy workloads (conversation history browsing). Connection pooling via PgBouncer (256 connections per pool) prevents connection exhaustion. **Caching Strategy:** Redis Cluster with 12+ nodes provides the caching layer. Active conversation contexts (last 10 messages) are cached with 1-hour TTL, eliminating ~80% of database reads for the hot path (LLM context assembly). Cache-aside pattern with write-through for conversation metadata ensures consistency. **Event Processing Scaling:** Kafka topics are partitioned by user_id (128 partitions per topic), allowing consumer groups to scale horizontally. Billing consumers run 32 instances processing events in parallel. Search indexer runs 16 instances with bulk indexing to Elasticsearch. **Multi-Region Deployment:** The system deploys in US-East, US-West, and EU-West regions. Each region has its own Streaming Gateway fleet, Kong Gateway, and Redis cache. PostgreSQL uses Citus with the primary write cluster in one region and fast read replicas in others. For users requiring data residency (EU), a fully independent EU cluster is maintained. Global traffic routing via Route53 latency-based routing directs users to the nearest region. **LLM Backend Scaling:** The LLM Orchestrator implements a weighted round-robin across multiple API keys per provider, connection pooling to self-hosted vLLM instances (which auto-scale GPU nodes based on queue depth), and circuit breakers per backend. Self-hosted vLLM runs on p4d.24xlarge instances with auto-scaling groups targeting 70% GPU utilization. **CDN and Static Scaling:** CloudFront serves all static assets and SSR pages from 400+ edge locations. Shared conversation pages are cached at the edge with 5-minute TTL and cache invalidation on update. **Graceful Degradation:** Under extreme load, the system implements progressive degradation: (1) reduce max context window length, (2) disable search indexing temporarily, (3) queue non-streaming requests, (4) serve cached responses for identical recent queries, (5) display wait queue UI rather than errors.
This system design outlines a globally distributed, highly scalable conversational AI platform capable of serving 20 million daily active users with 500 million messages per day. The architecture employs a microservices approach with dedicated services for authentication, conversation management, real-time streaming, and LLM orchestration. The design emphasizes low-latency streaming responses (sub-500ms time to first token), horizontal scalability to support 100K+ concurrent WebSocket connections per region, and robust fault tolerance with automatic LLM backend failover. The system leverages a multi-region deployment with geographic load balancing, employs PostgreSQL with read replicas for durable conversation storage, Redis for session management and caching, and Kafka for asynchronous event processing. A dedicated LLM Gateway service abstracts multiple LLM providers (OpenAI, Anthropic, custom models), implements intelligent routing, rate limiting, and cost tracking. Real-time bidirectional communication is handled via WebSocket connections through a scalable connection manager, while a CDN delivers static assets and cached content globally.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| API Gateway | Kong Gateway with OpenResty (Nginx + Lua) | Single entry point for all client requests; handles routing, authentication validation, rate limiting, request/response transformation, and SSL termination | Kong provides high-performance reverse proxy with built-in plugins for authentication, rate limiting, logging, and circuit breaking. Handles 10K+ RPS per instance with horizontal scalability and proven in production at scale. |
| Authentication Service | Node.js with Passport.js + Auth0 for identity management | User registration, login, JWT token issuance and validation, OAuth integration, session management, and user profile management | Auth0 provides enterprise-grade authentication with built-in security features, MFA, social login, and scales automatically. Node.js offers fast token validation and can handle 5K+ auth requests per second per instance. |
| WebSocket Connection Manager | Go with Gorilla WebSocket library, deployed on Kubernetes with HPA | Maintains persistent WebSocket connections, handles connection lifecycle, message routing, presence management, and broadcasts streaming responses to clients | Go excels at concurrent connection handling with lightweight goroutines. Each instance can handle 10K+ concurrent WebSockets with minimal memory overhead. Stateless design allows horizontal scaling based on connection count. |
| Conversation Service | Java Spring Boot with Spring Data JPA | CRUD operations for conversation threads, message persistence, context window management, conversation search, and thread organization | Spring Boot provides mature transaction management, excellent PostgreSQL integration, and strong consistency guarantees. JPA simplifies complex queries for conversation history and search. Battle-tested at enterprise scale. |
| LLM Gateway Service | Python with FastAPI and LangChain for LLM orchestration | Abstracts multiple LLM providers, routes requests to appropriate backends, handles streaming, implements retry logic with exponential backoff, tracks costs per request, and provides automatic failover | Python ecosystem has best LLM library support (OpenAI SDK, Anthropic SDK, transformers). FastAPI provides async streaming support essential for token-by-token delivery. LangChain simplifies multi-provider integration and context management. |
| File Processing Service | Python with Celery for async processing, Tesseract for OCR, PyPDF2 for PDF parsing | Handles file uploads, validates file types and sizes, extracts text from documents (OCR, PDF parsing), processes images for vision models, and stores files in object storage | Python has rich libraries for document processing and image manipulation. Celery provides distributed task queue for async processing of large files without blocking API responses. Can scale workers independently based on queue depth. |
| Search Service | Elasticsearch with custom analyzers for semantic search | Indexes conversation content, provides full-text search across message history, supports filtering by date, model, and tags | Elasticsearch provides sub-second full-text search across billions of documents. Supports complex queries, filtering, and aggregations. Can be extended with vector embeddings for semantic search. Scales horizontally with sharding. |
| Rate Limiter Service | Redis with Lua scripts for atomic rate limiting operations | Enforces per-user and per-tier rate limits, quota management, token bucket algorithm implementation, and communicates with billing service | Redis provides in-memory performance (<1ms latency) essential for rate limit checks on every request. Lua scripts ensure atomic operations for token bucket algorithms. Redis Cluster provides high availability and scales to millions of users. |
| Analytics & Monitoring Service | ClickHouse for OLAP analytics with Grafana for visualization | Collects usage metrics, tracks costs per request and per user, monitors system health, generates reports for admin dashboard | ClickHouse excels at high-volume time-series analytics with billions of rows, providing sub-second query performance for dashboards. Columnar storage reduces costs. Grafana provides rich visualization for admin dashboards. |
| Notification Service | Node.js with SendGrid for email, Firebase Cloud Messaging for push | Sends email notifications, push notifications, and in-app alerts for quota limits, system updates, and shared conversations | SendGrid provides reliable email delivery with analytics. FCM supports cross-platform push notifications. Node.js event-driven architecture handles high-volume async notifications efficiently. |
| Share Service | Go with Redis for link metadata caching | Generates unique shareable links for conversations, manages privacy settings and expiration, renders public conversation views | Go provides fast link generation and validation. Redis caches share metadata to avoid database lookups on every public link access. Stateless design allows easy scaling for viral shared conversations. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| PostgreSQL 15 with Citus extension for horizontal sharding | sql | Primary datastore for users, conversations, messages, and relationships. Citus enables horizontal sharding by user_id to handle billions of messages. JSONB support for flexible message metadata. Strong ACID guarantees ensure conversation consistency. Read replicas handle query load. |
| Redis Cluster | cache | Multi-purpose: JWT session storage, rate limiting counters, conversation context caching, WebSocket connection metadata, and hot conversation cache. Sub-millisecond latency critical for rate limiting and session validation. Redis Cluster provides automatic sharding and replication. |
| Elasticsearch 8.x | search | Full-text search across conversation history. Handles complex queries with filters, highlighting, and relevance scoring. Inverted indexes provide fast search across billions of messages. Can be extended with kNN for semantic search using embeddings. |
| Amazon S3 with CloudFront CDN | blob | Stores uploaded files (images, documents), exported conversations, and shared conversation snapshots. S3 provides 99.999999999% durability, lifecycle policies for cost optimization, and versioning. CloudFront accelerates file delivery globally. |
| Apache Kafka | queue | Event streaming backbone for async processing: analytics events, usage tracking, cost calculation, audit logs, and notification triggers. Kafka provides durable message storage, replay capability, and scales to millions of events per second. Decouples producers from consumers. |
| ClickHouse | nosql | Time-series analytics database for usage metrics, cost tracking, and admin dashboards. Optimized for OLAP queries with aggregations across billions of rows. Columnar storage provides 10-100x compression. Real-time ingestion from Kafka. |
| Method | Endpoint | Description |
|---|---|---|
POST | /api/v1/auth/register | Register a new user account with email and password, returns JWT access and refresh tokens |
POST | /api/v1/auth/login | Authenticate user credentials and issue JWT tokens with user tier information |
POST | /api/v1/conversations | Create a new conversation thread, returns conversation_id and initial metadata |
GET | /api/v1/conversations/{conversation_id} | Retrieve full conversation thread with all messages, supports pagination and filtering |
POST | /api/v1/conversations/{conversation_id}/messages | Send a new message in a conversation, triggers LLM processing, returns message_id for tracking |
WS | /ws/v1/stream | WebSocket endpoint for real-time bidirectional communication, streams LLM responses token-by-token, handles connection lifecycle |
GET | /api/v1/conversations/search | Full-text search across user's conversation history with filters for date range, model, and tags |
POST | /api/v1/files/upload | Upload files for multimodal input, supports images and documents up to 50MB, returns file_id and processing status |
POST | /api/v1/conversations/{conversation_id}/share | Generate a public shareable link for a conversation with configurable expiration and privacy settings |
GET | /api/v1/models | List available LLM models with capabilities, pricing, and context window information |
GET | /api/v1/users/me/usage | Get current user's usage statistics, quota consumption, and rate limit status |
GET | /api/v1/admin/analytics/usage | Admin endpoint for aggregated usage metrics, costs by model, and active user statistics |
DELETE | /api/v1/conversations/{conversation_id} | Soft delete a conversation thread, marks as deleted but retains for recovery period |
**Horizontal Scaling Approach:** 1. **Stateless Services**: All application services (API Gateway, Conversation Service, LLM Gateway, Auth Service, WebSocket Manager) are stateless and containerized with Kubernetes. Auto-scaling policies based on CPU (70% threshold) and custom metrics (concurrent connections for WS Manager, queue depth for File Processing). 2. **WebSocket Connection Distribution**: Each WebSocket Manager instance handles 10K concurrent connections. With 100K target per region, deploy 10+ instances with sticky session routing at the load balancer level using consistent hashing on user_id. Connection metadata stored in Redis allows any instance to route messages. 3. **Database Sharding**: PostgreSQL with Citus extension shards data by user_id across 16 initial shards, expandable to 64+. Each shard handles ~1.25M users. Read replicas (3 per shard) distribute query load. Message tables partitioned by created_at (monthly) for efficient archival. 4. **LLM Gateway Scaling**: Python FastAPI instances scaled based on request queue depth in Kafka. Each instance maintains connection pools to external LLM APIs (OpenAI, Anthropic) with circuit breakers. Geographic proximity routing to LLM endpoints reduces latency. 5. **Caching Strategy**: Redis Cluster with 12 nodes (4 shards × 3 replicas) caches: conversation contexts (30min TTL), user sessions (24hr), rate limit counters (1hr sliding window), hot conversations (top 10% by access). Cache hit rate target: 85%+. 6. **Multi-Region Deployment**: Deploy across 3 regions (US-East, EU-West, Asia-Pacific) with Route53 geo-routing. Each region handles 7M DAU. Cross-region PostgreSQL replication (async) for disaster recovery. Kafka MirrorMaker 2 replicates events for analytics aggregation. **Vertical Scaling Considerations:** - PostgreSQL instances: Start with r6g.4xlarge (16 vCPU, 128GB RAM), scale to r6g.8xlarge for primary. Read replicas on r6g.2xlarge. - Redis Cluster: r6g.xlarge nodes (4 vCPU, 32GB RAM per node). - LLM Gateway: CPU-optimized c6i.2xlarge for fast Python execution. - ClickHouse: Storage-optimized i3en.2xlarge for cost-effective analytics. **Capacity Planning for 500M messages/day**: ~5,800 msgs/sec sustained, 12K msgs/sec peak. Each LLM Gateway instance handles 50 concurrent requests × 20 regions × 10 instances = 10K concurrent LLM requests. Over-provision by 50% for traffic spikes and failover capacity.
This system design describes a globally distributed, high-concurrency platform similar to ChatGPT, capable of handling 20M DAU and 500M messages per day. The architecture focuses on low-latency streaming (TTFT < 500ms), immediate consistency for conversation history, and high availability across multiple LLM backends through an intelligent inference orchestration layer. It utilizes an event-driven model for background tasks like cost tracking and search indexing, while maintaining persistent connections for real-time interaction.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| Global Load Balancer | Google Cloud Load Balancing or AWS Global Accelerator | Routes traffic to the nearest geographic region and handles SSL termination. | Provides low-latency entry points and sophisticated health-checking across global regions. |
| Edge Gateway / API Gateway | Kong or Envoy | Handles authentication, rate limiting (per-tier), and request routing. | High-performance proxy that supports custom plugins for quota management and JWT validation. |
| Chat & Context Service | Go (Golang) | Orchestrates chat logic, manages conversation state, and formats prompts. | Golang's concurrency model (goroutines) is ideal for managing thousands of simultaneous streaming connections with low memory overhead. |
| Inference Orchestrator | Custom microservice (Python/FastAPI or Go) | Routes requests to LLM backends, handles retries, circuit breaking, and failover. | Decouples the chat logic from specific LLM APIs, allowing for dynamic weight shifting and cost optimization. |
| Streaming Engine | Server-Sent Events (SSE) over HTTP/2 | Maintains persistent connections for pushing tokens to the client. | SSE is more efficient than WebSockets for unidirectional streaming from server to client and handles reconnections natively. |
| Usage & Billing Service | Apache Flink | Tracks token consumption and costs per user/request for real-time quota enforcement. | Required for real-time stream processing of token counts to prevent over-usage beyond quotas. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| PostgreSQL (with Citus) | sql | Ensures ACID compliance and immediate consistency for chat history. Citus allows horizontal sharding to handle 500M messages/day. |
| Redis | cache | Used for session management and caching recent conversation context to minimize DB hits during active turns. |
| Elasticsearch | search | Provides full-text search capabilities over millions of conversations with complex filtering (by date, model, or folder). |
| Amazon S3 / Google Cloud Storage | blob | Durable storage for multimodal inputs (images, PDF documents) and exported chat logs. |
| Apache Kafka | queue | Decouples chat streaming from analytical/billing tasks. Ensures that slow storage or billing updates do not block the user response. |
| Method | Endpoint | Description |
|---|---|---|
POST | /v1/auth/login | Authenticates user and returns a JWT session token. |
POST | /v1/chat/completions | Primary endpoint for sending messages. Supports 'stream: true' for SSE responses. |
GET | /v1/conversations | Retrieves a paginated list of the user's conversation history. |
POST | /v1/conversations/{id}/share | Generates a public, read-only URL for a specific conversation thread. |
POST | /v1/files/upload | Uploads multimodal content; returns a file ID for inclusion in chat completions. |
GET | /v1/models | Lists available LLM backends and their specific capabilities (e.g., vision, long context). |
The system scales horizontally at the service level using Kubernetes. The Chat Service and Inference Orchestrator are stateless, allowing auto-scaling based on CPU/Memory and concurrent connection counts. Database scalability is achieved through PostgreSQL sharding on 'user_id' to ensure data locality for a single user's history. Regional data isolation is used to meet 100k connection requirements per region, while a global Redis layer or DB replication handles shared state like public links.
A distributed, event-driven architecture designed to support 20M+ DAU for a ChatGPT-like application. The system leverages persistent WebSocket connections for low-latency streaming (TTFT < 500ms), a Model Orchestration Layer to abstract various LLM backends, and a tiered storage strategy (Redis -> DynamoDB -> S3) to handle the high write throughput of 500M messages per day. The design prioritizes interactivity and durability while ensuring strict cost governance and rate limiting.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| Edge Gateway / API Gateway | AWS Application Load Balancer + Kong Gateway | SSL termination, Geo-routing, Rate limiting, Authentication verification. | Kong provides robust plugin support for rate-limiting (Token Bucket) and JWT validation before traffic hits internal services. |
| Connection Manager (Chat Service) | Go (Golang) on Kubernetes | Manages WebSocket connections, broadcasts stream chunks, handles user state. | Go's Goroutines are ideal for handling hundreds of thousands of concurrent WebSocket connections with low memory footprint compared to Node.js or Python. |
| Model Orchestrator | Python (FastAPI) with LangChain adapters | Standardizes API calls to different LLM providers, handles retry logic, and failover. | Python ecosystem has the best libraries for LLM integration. Isolating this allows independent scaling based on inference latency. |
| Context Assembly Service | Rust Microservice | Retrieves relevant chat history and injects system prompts/RAG context before inference. | Requires extremely low latency to fetch and tokenize text before sending to the LLM to meet the 500ms TTFT constraint. |
| Billing & Analytics Consumer | Apache Flink | Consumes completed message events to calculate costs and update quotas. | Stateful stream processing needed to aggregate token usage in real-time for strict quota enforcement. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| Amazon DynamoDB | nosql | Primary store for Chat History. Supports massive write throughput (500M msgs/day) and efficient querying by Partition Key (ConversationID) and Sort Key (Timestamp). |
| Redis Cluster | cache | Stores active session state, recent conversation context (window), and user rate limit counters to minimize latency on the critical path. |
| Amazon S3 | blob | Storage for user-uploaded images/documents. Low cost, high durability, and allows offloading bandwidth via Presigned URLs. |
| PostgreSQL | sql | Stores structured relational data: User profiles, Organization hierarchies, Billing Invoices, and configuration settings. |
| Elasticsearch / OpenSearch | search | Provides full-text search capabilities over chat history, which DynamoDB cannot handle efficiently. |
| Method | Endpoint | Description |
|---|---|---|
WS | /ws/v1/chat | Main WebSocket endpoint for bi-directional streaming of prompts and LLM responses. |
POST | /v1/conversations | Creates a new conversation thread, returns conversation_id. |
GET | /v1/conversations/{id}/messages | Retrieves paginated message history for a specific conversation. |
GET | /v1/models | Lists available LLM models user is authorized to use. |
POST | /v1/files/upload-url | Generates a presigned S3 URL for uploading images or documents. |
Horizontal scaling via Kubernetes HPA based on CPU and custom metrics (Active WebSocket Connections). Database scales via DynamoDB On-Demand capacity or provisioned capacity with auto-scaling. The system is sharded by ConversationID for data locality. Redis Cluster handles hot-path reads. A Queue-based decoupling (Kafka) allows background tasks (search indexing, analytics) to scale independently of the real-time chat service.
A globally distributed, real-time conversational AI platform supporting multi-turn chat with multiple LLM backends, multimodal inputs, and rich history management. The system is built for 20M DAUs and 500M messages/day with sub-500ms time-to-first-token via a high-performance WebSocket gateway, an LLM routing layer with fast failover, and region-affine, strongly consistent storage for conversation history. Analytics, cost tracking, and admin observability are first-class through an events pipeline into ClickHouse and Prometheus/Grafana. The core data plane is stateless, horizontally scalable on Kubernetes, and tolerant of provider or regional failures.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| Client Web App | Next.js + React, TypeScript, WebSocket, highlight.js, Markdown-it | SPA for chat UI, Markdown rendering, code highlighting, WebSocket streaming, file uploads, search, and sharing | Mature ecosystem, SSR for SEO (shared links), excellent dev productivity and performance |
| Edge CDN/WAF & Global LB | Cloudflare CDN + Cloudflare Load Balancer + Bot Management | TLS termination, caching static assets, DDoS/WAF, geo-steering to nearest healthy region | Global footprint, anycast, robust WAF and health-based geo-routing to meet latency and availability targets |
| API & WebSocket Gateway | Go microservice on Kubernetes with NGINX Ingress (ALB) and HTTP/2; gorilla/websocket; gRPC to internal services | Single entry for REST and WebSocket; authZ/authN checks, rate limiting, session validation, request fan-out to internal services; streams tokens to client | Go delivers low-latency IO and high concurrency; stable WS handling; NGINX Ingress + ALB scale well |
| Auth Service | Auth0 (OIDC) + JWT (RS256) | User identity, OAuth/social login, JWT issuance, refresh tokens, RBAC/roles (user/admin) | Fast to integrate, enterprise SSO, adaptive MFA; offloads identity risk; standards-compliant OIDC |
| Rate Limit & Quota Service | Envoy Global Rate Limit Service + Redis Cluster; Lua in NGINX for shadow checks | Enforces per-user/tier rate limits (sliding window) and quotas; provides near-real-time counters | Envoy RLS is battle-tested; Redis offers sub-ms counters and atomicity with Lua scripts |
| Session/Cache Store | Redis Cluster (6.x) with Redis Streams for ephemeral events | JWT blacklist, session metadata, ephemeral streaming buffers, recent context window cache | In-memory speed, high availability via clustering and replication |
| Conversation Service | PostgreSQL (Citus) multi-tenant sharded by user_id; Go service using pgx | CRUD for conversations/messages, context building, sharing ACLs, foldering/tags; transactional writes | Immediate consistency and SQL semantics; Citus scales horizontally and keeps p95 low with partitioning |
| Search/Indexing Service | OpenSearch (multi-az) + k-NN plugin; background workers (Go) for indexing | Full-text search over titles/messages; semantic search via embeddings; indexing pipeline | Scalable search with near real-time indexing; k-NN for semantic search without extra vector DB |
| LLM Router | Go service with gobreaker, HTTP/2 keep-alive pools; provider SDKs; configuration via Consul/etcd | Model catalog, routing to providers/in-house; health checks, circuit breakers, retries, cost-aware selection; streaming token multiplexing | Low-latency, robust control plane with per-provider health and dynamic routing rules |
| Provider Connectors | Connectors for "OpenAI/Anthropic/Azure OpenAI/Google Vertex"; retries with exponential backoff; streaming adapters | Integrations to external LLMs and embeddings | Diversity reduces provider risk and enables cost/performance optimization |
| In-house Inference Cluster | vLLM on Kubernetes GPU nodes (NVIDIA A10/A100), Triton for embeddings; Istio for mTLS | Self-hosted models (vLLM) for failover and cost control; embeddings server | High throughput, streaming-friendly; cost-efficient for baseline models and embeddings |
| File Ingestion Service | Amazon S3 + S3 Object Lambda (virus scan with ClamAV) + AWS Textract + Apache Tika; Step Functions for orchestration | Pre-signed uploads, virus scanning, OCR/text extraction, chunking; links assets to messages | Serverless pipeline scales elastically; S3 durability and cost efficiency for blobs |
| Cost & Billing Service | Kafka consumers (Go) -> ClickHouse for analytics; Postgres for authoritative balances | Compute per-request cost (provider rates, tokens, GPU time), store usage, expose invoices and quotas | ClickHouse excels at high-ingest analytics; Postgres for transactional balances and limits |
| Event Bus | Apache Kafka (AWS MSK) | Asynchronous events: usage, costs, audit logs, indexing triggers | High-throughput, durable event streaming; ecosystem support |
| Analytics & Monitoring | Prometheus + Grafana; OpenTelemetry + Jaeger; Loki for logs; CloudWatch for infra | Dashboards, alerts, traces, logs | Proven OSS stack, vendor-neutral instrumentation |
| Admin Dashboard | Next.js + RBAC; reads from ClickHouse/Prometheus/Postgres | Operational UI: usage, costs, errors, provider health, throttles; model catalog management | Unified operational control plane with low-latency analytics queries |
| CDN Assets & Static Hosting | Cloudflare + S3 static site hosting | Serve static JS/CSS/images | Global low-latency delivery for assets |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| PostgreSQL (Citus) | sql | Strong consistency and transactions for conversations/messages; Citus provides horizontal sharding by user_id with high write throughput and low-latency queries |
| Redis Cluster | cache | Sub-millisecond counters for rate limits, sessions, ephemeral streaming buffers, and hot context windows |
| Amazon S3 | blob | Durable, cost-effective storage for file uploads, images, and large attachments; lifecycle policies for tiering |
| OpenSearch | search | Full-text and semantic search with k-NN; scalable indexing and near real-time search for conversation history |
| Kafka (AWS MSK) | queue | Durable, scalable event streaming for usage, billing, indexing, and audit logs decoupling producers/consumers |
| ClickHouse | sql | High-ingest, columnar analytics for usage and cost reporting; sub-second aggregations at scale |
| Method | Endpoint | Description |
|---|---|---|
WS | /v1/ws | Bidirectional WebSocket for sending user messages and receiving token-streaming responses and events |
POST | /v1/conversations | Create a new conversation (title, tags, model selection, visibility) |
GET | /v1/conversations | List conversations with filters (folder, tag, starred) and pagination |
GET | /v1/conversations/{id} | Get a conversation with messages (server-side pagination) |
POST | /v1/conversations/{id}/messages | Add a user message to a conversation (text, file refs, tool calls) |
GET | /v1/messages/{id} | Get message detail and streaming status |
GET | /v1/search | Search conversations/messages (full-text + semantic options) |
GET | /v1/models | List available models and tiers, pricing metadata |
POST | /v1/files | Initiate file upload and get pre-signed URL; returns file_id |
POST | /v1/share/{conversation_id} | Create/update share link (public/unlisted/expire) |
GET | /v1/usage | Per-user usage and remaining quota by period |
GET | /v1/admin/metrics | Admin: provider health, error rates, throughput, cost summaries |
PUT | /v1/admin/models | Admin: manage model catalog, routing weights, and availability |
- Traffic and sessions: Anycast via Cloudflare to nearest region. Sticky session not required; WebSocket connections are long-lived and evenly distributed via ALB. Gateway pods autoscale on CPU and open FDs; each Go pod targets ~4–5K concurrent WS; 30 pods suffice for 150K WS with headroom per region. - Storage: Citus shards by user_id across nodes; co-locate primary and replicas in same AZ to minimize latency. Connection pooling with PgBouncer. Hot partitions handled by rebalancing shards. PITR and logical replication to DR region. - Search: OpenSearch domain scales horizontally across data nodes. Index with 1–3 primary shards per index and ILM for rollover. Async indexers consume from Kafka for sustained throughput. - LLM routing: Health probes and circuit breakers per provider/region; latency-aware load balancing and hedged requests before first token. In-house vLLM autoscaling on GPU metrics (queue depth, tokens/sec). Keep-alive HTTP/2 pools to reduce TTFB. - Rate limiting: Redis Cluster with hash tags for per-user keys ensures single-shard updates. Use sliding window with Lua for atomicity. Quotas aggregated periodically from ClickHouse and persisted to Postgres for authority. - WebSockets: Separate HPA based on concurrent connections and network IO. Use SO_REUSEPORT and pod anti-affinity. Idle pings to detect dead peers. Backpressure controls to avoid OOM. - Multi-region: Active-active per region; users are region-affined based on home region stored in profile. Cross-region failover via Cloudflare LB health checks; if home region is down, read last durable snapshot from DR and write to degraded mode (queue for backfill) with user-notice. - Observability: OpenTelemetry traces propagate across gateway, LLM router, connectors. SLO-based autoscaling and alerting for TTFT and error budgets.
A globally distributed, real-time web platform that enables multi-turn conversations with configurable LLM backends, streaming responses token-by-token, durable conversation history, multimodal inputs, per-user quotas and billing, and admin monitoring. The design uses managed cloud components where appropriate (AWS examples used for concreteness) and is built for 20M DAU and ~500M messages/day — with multi-region deployment, autoscaling WebSocket clusters, strong consistency for conversation data, semantic search, and resilient LLM backend routing with automatic failover and cost accounting.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| Edge / CDN | AWS CloudFront + AWS WAF (or Cloudflare) for global CDN & edge security | Global caching, TLS termination, hosting static assets, and routing to nearest API region. Protect against DDoS and serve prerendered content. | Low latency global content delivery and edge protections. CloudFront integrates with regional ALBs and AWS API Gateway; WAF provides DDoS/IPS rules. |
| API Gateway (REST & WebSocket) | AWS API Gateway (HTTP/APIGW v2 for WebSocket) or AWS Application Load Balancer + NLB for WebSocket if custom stack preferred | Ingress point for HTTP(S) REST APIs and managed WebSocket connections, authentication/authorization integration, metrics, and throttling. | Managed API Gateway handles large scale WebSocket connections reliably and integrates with Lambda and VPC targets. Reduces operational burden to meet 100K+ connections per region. |
| Auth Service / Identity | Auth0 or Amazon Cognito (or self-hosted Keycloak for more control) | User authentication (email/password, OAuth), session issuance, token lifecycle, MFA, and account management. | Managed identity reduces time to market; Cognito/Auth0 handle scaling, OIDC/OAuth flows, social login, and integrate with API Gateway and IAM. Can fallback to Keycloak if self-hosting required for compliance. |
| Frontend (Web & Mobile clients) | React + Next.js for Web (SSR), React Native for mobile; use WebSocket & SSE clients for streaming. Use remark/rehype for markdown rendering and DOMPurify for sanitization. | UI for conversations, streaming UI, markdown rendering/sanitization, file uploads, sharing links, offline behaviors, and websocket clients. | Next.js gives performant SSR/CSR mix and edge support; well-supported libraries for markdown and security. |
| Connection Manager / WebSocket Workers | Kubernetes (EKS) running horizontally scaled WebSocket worker pods behind API Gateway or ALB, using Envoy/ingress for routing. Use Redis for presence/connection metadata. | Maintain WebSocket connections, route tokens to clients, enforce per-connection rate limits, maintain ephemeral state, and connect to LLM streaming output. | Kubernetes provides autoscaling and lifecycle control. Breaking stream work into worker pods allows streaming token-by-token with low-latency writes to sockets; Redis stores connection mapping for routing in multi-pod deployments. |
| Conversation Service (API) | Stateless microservice in Kubernetes (gRPC/HTTP) with connection to Aurora PostgreSQL (Primary writer) and a caching layer (Redis). | Handles conversation CRUD, multi-turn context assembly, versioning, bookmarks, shareable link creation, and immediate persistent writes. | Stateless services scale easily. Aurora PostgreSQL provides strong consistency and supports high write throughput with multi-AZ. Redis accelerates hot path reads and rate-limit checks. |
| Message Ingest & Streaming Orchestrator (LLM Router) | Stateless microservice (Kubernetes) using gRPC to LLM backends; feature-rich router capability using Hystrix-like circuit breaker libraries and per-model adapters. Persist logs & events to Kafka (MSK) for downstream processing. | Orchestrates sending prompts to selected LLM backend(s), streams tokens back to Connection Manager, calculates per-request cost, applies circuit-breakers and failover to alternate models/backends, and logs telemetry. | Centralized routing simplifies failover, cost accounting, and policy enforcement. gRPC yields low-latency backend calls; Kafka provides durable eventing for billing and analytics. |
| LLM Backends | Hybrid: External providers (OpenAI/Anthropic) + Internal GPU clusters orchestrated by Kubernetes + Triton / NVIDIA TensorRT / Ray Serve for model serving. Use model proxies that expose gRPC or HTTP streaming. | Provide model inference and token streaming. Could be managed external APIs (OpenAI, Anthropic) and/or internal GPU clusters (private models). | Hybrid provides capacity and cost controls: external for burst/spiky loads and internal for steady-state/private models. Triton/Ray Serve are production-ready for large model serving with streaming support. |
| Cache & Rate-Limit Store | Redis (Amazon ElastiCache in clustered mode with clustering-enabled Redis or Redis Enterprise) | Fast token-bucket rate limits, session cache, short-lived conversation caches for hot reads, and presence store. | Redis supports very low-latency operations, atomic counters, Lua scripting for rate-limiting logic, and clustering for scale. |
| Durable Storage (Conversations / Metadata / Billing) | Amazon Aurora PostgreSQL (clustered, multi-AZ, read-replicas) with partitioning/sharding by tenant or hashed conversation id. | Immediate-consistency primary store for conversations, messages, user metadata, billing records, and access controls. | Relational strong consistency and transactions for immediate-consistency requirement; Aurora scales reads and provides high durability and automated backups. |
| Object Store (Files & Attachments) | Amazon S3 with S3 Object Lambda hooks; presigned uploads; Lambda for scanning via ClamAV or third-party virus scanning | Store uploaded files (images, docs) and serve them to model pipelines and clients via presigned URLs; lifecycle & virus-scan results. | S3 is durable, scalable, and cost-effective; presigned uploads offload bandwidth; Lambda-based scanning pipeline can be used asynchronously. |
| Search & Embeddings | Hybrid: OpenSearch (for keyword/structured search) + Vector DB (Pinecone, Milvus, or Amazon OpenSearch vector plugin) for embeddings. Use a managed embedding service or produce embeddings via dedicated model instances and store vectors in vector DB. | Text and semantic search across conversation history and attachments; embedding generation and vector search. | OpenSearch handles traditional search and filters; vector DB supports semantic similarity at scale. Separating concerns lets us scale search independently. |
| Event Bus / Streaming & Analytics | Apache Kafka (Amazon MSK) for high-throughput durable logs; Kafka Connect to data warehouse (Snowflake/BigQuery) and stream processors (Flink/Kafka Streams). | Durable eventing for audit logs, billing events, metrics, and asynchronous jobs (indexing, notifications, cost aggregation). | Kafka scales well for hundreds of thousands of events/sec and supports exactly-once processing patterns enabling accurate billing and analytics. |
| Billing & Cost Accounting | Service that consumes Kafka billing events, applies per-model cost rates, stores detailed line-items in PostgreSQL and aggregates in OLAP (BigQuery/Snowflake) for reports. Use serverless ETL for daily aggregation. | Accurate per-request cost tracking, aggregation to user billing, tier enforcement, and exports to billing system. | Event-driven accounting keeps near-real-time cost tracking for each request; OLAP enables fast analytics and admin dashboards. |
| Admin Dashboard & Observability | Prometheus + Grafana for metrics; Jaeger for distributed traces; ELK/OpenSearch for logs; Grafana dashboards with role-based access. Admin frontend built on React + RBAC. | System metrics, alerts, per-user/tier usage, cost dashboards, model-health, and structured logs. | Standard observability stack with tracing allows operators to debug and monitor the system and analyze cost/usage. |
| Security & Compliance | AWS KMS for secrets, IAM for infra access control, Vault (HashiCorp) for application secrets if self-hosting; S3 encryption and TLS everywhere. | Access controls, secret management, key rotation, audit logs, data deletion/export endpoints, encryption at rest/in transit, and DLP for file scanning. | Managed key stores and RBAC minimize operational overhead while meeting compliance. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| Amazon Aurora PostgreSQL | sql | Provides immediate consistency, transactions, and durability for conversation history and billing line-items. Aurora supports multi-AZ, read-replicas, partitioning/sharding, and scales to high throughput with proper schema design. |
| Redis (ElastiCache Clustered) | cache | Low-latency data for rate-limiting, session/presence mapping, token-bucket counters, and ephemeral caching of recent conversation context for fast reads. |
| Amazon S3 | blob | Durable, cost-efficient object storage for user uploaded files and model artifacts. Supports presigned uploads and lifecycle policies; integrates with object-lambda for scanning/transformations. |
| Apache Kafka (Amazon MSK) | queue | Durable, high-throughput event stream for message events, billing events, and indexing streams. Enables decoupled asynchronous processing (search indexing, billing aggregation, analytics). |
| OpenSearch (Elastic) + Vector DB (Pinecone or Milvus) | search | OpenSearch for keyword/structured search and filters; vector DB for semantic similarity search on embeddings. Scales independently and supports fast retrieval of relevant conversation segments. |
| OLAP (BigQuery or Snowflake) | nosql | For cost/billing analytics and historical reporting at scale. Stores aggregated billing/usage records and enables fast analytics for admin dashboards and finance exports. |
| Method | Endpoint | Description |
|---|---|---|
POST | /api/v1/auth/login | Authenticate user (email/password or OAuth token exchange). Returns access token and refresh token. Initiates session and rate-limit metadata. |
GET | /api/v1/conversations | List user conversations with pagination, sorting, and filters (by tag, model, shared). Uses read-replica; consistent with write-through caching invalidation. |
POST | /api/v1/conversations | Create a new conversation; specify model, system prompt, privacy/sharing options, and optional attachments. |
POST | /api/v1/conversations/{conversationId}/messages | Send a new user message to a conversation. Persists message, triggers inference via LLM Router, and returns inference-id. Supports multimodal references (file IDs). |
WS | /api/v1/conversations/{conversationId}/stream | WebSocket endpoint for real-time streaming of LLM responses (token-by-token) and message events. Supports client acknowledgements, reconnect/resume semantics, and server pings. |
POST | /api/v1/files | Request presigned URL for upload or upload metadata. After upload, file is scanned asynchronously; returns file ID for model input. |
GET | /api/v1/models | List available models with capabilities, estimated cost/token, latency SLAs, and fallback rules. |
POST | /api/v1/conversations/{conversationId}/share | Create a public, shareable link (read-only) with optional expiry and password protect settings. |
GET | /api/v1/admin/metrics | Admin-only metrics endpoint aggregated from Prometheus/OLAP for usage, costs, model health, and alerts. Requires admin RBAC. |
Multi-region deployment with region-local clusters (API Gateway + EKS + Aurora in each region or read-only replicas cross-region depending on data residency). Horizontal scaling: stateless frontends and LLM Router scale via Kubernetes HPA/KEDA based on CPU/RPS/queue length. WebSocket workers scale horizontally; use managed API Gateway or ALB to handle connection scaling. Redis is scaled as a clustered ElastiCache with sharding; Aurora can be scaled by sharding conversations by tenant or hashing conversationId to different writer clusters for write throughput. Use Kafka (MSK) partitions scaled by throughput and consumer groups for parallel processing. Use autoscaling GPU pools for internal model serving (using Karpenter/Cluster Autoscaler) and spot instances to reduce cost for non-critical capacity. Employ edge caching (CloudFront) for static assets and read-heavy metadata. For search and embeddings, scale vector DB clusters independently. For global throughput, employ traffic steering to nearest region with failover and active-passive or active-active DB strategy where legal/regulatory constraints permit.
The system is a globally distributed, multi-tenant conversational AI web application supporting authenticated users, multi-turn threads, token-streaming responses, file/multimodal inputs, conversation search, sharing links, and an admin cost/usage dashboard. It is designed for 20M DAU and ~500M messages/day with strict latency requirements (TTFT < 500ms) and high concurrency (>=100K concurrent WebSocket connections per region). The architecture separates the latency-critical request/streaming path (WebSocket Gateway + Orchestrator + LLM adapters) from durable storage, indexing, analytics, and billing pipelines. Conversations are stored in a strongly consistent SQL store, while search and analytics are powered by specialized systems. LLM backend failures are handled via circuit breakers, hedged requests, and provider failover with per-token streaming preserved.
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Component | Technology | Responsibility | Justification |
|---|---|---|---|
| Web Client (Browser) + Mobile | Next.js (React) + TypeScript; Markdown-it + DOMPurify; WebSocket client | UI for chat, conversation list, model selection, file upload, markdown rendering, and real-time streaming display via WebSocket/SSE fallback. | Next.js supports SSR/SPA, fast iteration, and edge-friendly deployments. Markdown-it is extensible for code blocks/tables; DOMPurify prevents XSS. WebSocket enables low-latency bidirectional streaming. |
| Global DNS + CDN/WAF | Cloudflare (DNS, CDN, WAF, Bot Management) | Global traffic steering, TLS termination at edge, DDoS protection, caching of static assets, WAF rules, and bot mitigation. | Strong global presence reduces latency and protects origin. Bot/DDoS controls are critical at 20M DAU. |
| API Gateway / Edge | Envoy Gateway (Kubernetes) + Cloudflare origin rules | Routing for REST APIs and WebSocket upgrades, auth pre-checks, request shaping, and regional failover. | Envoy provides high-performance L7 routing, retries, timeouts, and observability. Works well with WebSockets and service mesh patterns. |
| Auth & Session Service | Auth0 (OIDC) + internal Session API using JWT (short-lived) + Redis for session revocation | User signup/login, OAuth/OIDC, session issuance, refresh, MFA support, and entitlement lookup for tiers. | Auth0 reduces security risk and time-to-market. Short-lived JWT minimizes DB calls; Redis enables immediate revocation/ban. |
| WebSocket Gateway (Streaming Gateway) | Kubernetes-deployed Node.js (uWebSockets.js) or Go (fasthttp + websocket) service; Redis Cluster for ephemeral connection metadata | Manages WebSocket connections, fan-out of token streams, backpressure, connection state, and regional scaling to >=100K concurrent connections. | Specialized gateway isolates long-lived connections from general API traffic. Go/uws handle high concurrency efficiently; Redis supports lightweight presence/state without coupling to DB. |
| Chat Orchestrator Service | Go microservice (gRPC internally) with circuit breakers (hystrix-like) and retries (Envoy + app-level) | Core chat workflow: validate quotas, build context, call LLM backends, stream tokens, handle tool/file references, persist messages atomically, and emit usage/cost events. | Go offers predictable latency and high throughput. Central orchestration simplifies consistency and billing correctness while keeping streaming path tight. |
| LLM Provider Adapter Layer | Internal service/library used by Orchestrator; supports OpenAI-compatible streaming + Bedrock + Anthropic; optional self-hosted vLLM on GPU nodes | Uniform interface for multiple model providers (e.g., OpenAI, Anthropic, AWS Bedrock, self-hosted vLLM), token streaming normalization, automatic failover/hedging, and provider-specific auth. | Decouples product from provider APIs and enables rapid switching, routing, and fallback strategies to meet availability/latency constraints. |
| Conversation Service | Java/Kotlin (Spring Boot) or Go; PostgreSQL-compatible distributed SQL (YugabyteDB) | CRUD for conversations, messages, metadata (title, tags, folders), share settings, and immediate-consistency reads. | Distributed SQL provides strong consistency with horizontal scaling and multi-region resilience. A dedicated service encapsulates schema and access patterns. |
| Search/Indexing Service | Elasticsearch (managed, e.g., Elastic Cloud) + Kafka Connect for indexing pipeline | Index conversation/message text and metadata for fast search, filtering, and ranking; supports near-real-time updates. | Elasticsearch is well-suited for full-text search and faceting at large scale. Kafka-based ingestion decouples indexing from the write path. |
| File Ingestion & Multimodal Pipeline | S3-compatible object storage (Amazon S3) + CloudFront signed URLs; ClamAV scanning; Apache Tika for parsing; optional GPU service for vision embeddings | Handle uploads, virus/malware scanning, document parsing (PDF/DOCX), image preprocessing, OCR, embedding generation, and secure storage/links. | Object storage is the standard for large binary data. Scanning and parsing protect the platform. Signed URLs reduce origin load and limit unauthorized access. |
| Rate Limiting & Quota Service | Redis Cluster (token bucket/leaky bucket) + internal Quota API; optional Envoy global rate limit service | Per-user/per-tier rate limits (RPS), token quotas, daily/monthly usage, and enforcement in the hot path. | Redis offers sub-millisecond counters suitable for the 500ms TTFT constraint. Central policy keeps enforcement consistent across gateways. |
| Usage/Cost Metering Service | Kafka + stream processing (Apache Flink) + ClickHouse for analytics + PostgreSQL ledger tables | Compute accurate costs per request (tokens in/out, model pricing, file processing costs), generate billing-grade ledgers, and expose aggregates to admin/user dashboards. | Flink enables real-time aggregation while a PostgreSQL ledger ensures correctness and auditability. ClickHouse supports high-QPS analytics for dashboards. |
| Sharing Service | Go service + PostgreSQL (YugabyteDB) + CDN caching for public read views | Create public share links, snapshot/redaction, access control, and view tracking. | Share links require durable mapping and permissions. CDN accelerates read-heavy public access. |
| Admin & Observability Stack | Prometheus + Grafana; OpenTelemetry + Tempo/Jaeger; Loki; Sentry; Argo Rollouts for canary | Monitoring, tracing, logging, incident response, and admin dashboard for usage/cost/latency/provider health. | Standard cloud-native observability with strong ecosystem; canary reduces risk when changing critical streaming paths. |
| Message Bus / Event Backbone | Apache Kafka (managed, e.g., Confluent Cloud) | Decouple write path from indexing, analytics, notifications, and offline processing. | Kafka scales to very high throughput (500M messages/day) and enables replayable event streams for multiple consumers. |
⚠ Mermaid rendering can be flaky with LLM-generated diagrams. If the diagram fails, copy the source and paste it into mermaid.live for reliable rendering.
| Store | Type | Justification |
|---|---|---|
| YugabyteDB (PostgreSQL-compatible distributed SQL) | sql | Strong consistency and durability with horizontal scaling and multi-region replication; ideal for immediately consistent conversation history and share-link metadata. |
| Redis Cluster | cache | Sub-millisecond counters for rate limiting/quota enforcement; session revocation; ephemeral WebSocket connection metadata. |
| Apache Kafka | queue | High-throughput event backbone to decouple indexing, analytics, metering, and async file processing from the latency-critical chat path. |
| Amazon S3 (Object Storage) | blob | Durable, scalable storage for user uploads (images/documents) and generated artifacts; integrates with signed URLs and lifecycle policies. |
| Elasticsearch | search | Full-text search with faceting for conversation history at scale, supporting near-real-time indexing from Kafka. |
| ClickHouse | nosql | High-performance OLAP for admin/user dashboards on usage, costs, latency, and provider performance. |
| PostgreSQL (Billing Ledger) | sql | Billing-grade immutable ledger entries require strict constraints, transactions, and auditability; kept separate from high-volume chat OLTP. |
| Method | Endpoint | Description |
|---|---|---|
POST | /v1/auth/session | Exchange OIDC code for application session (JWT/refresh), return user profile and tier entitlements. |
POST | /v1/conversations | Create a new conversation (optionally with selected model, system prompt, folder/tags). |
GET | /v1/conversations/{conversationId} | Fetch conversation metadata and messages with strong consistency (latest turns). |
POST | /v1/conversations/{conversationId}/messages | Send a user message (non-streaming fallback) and receive the assistant response when complete. |
WS | /v1/ws/chat | WebSocket endpoint for streaming chat. Client sends message frames; server streams tokens/events (delta tokens, tool/file status, final). |
POST | /v1/files | Request an upload session; returns signed upload URL(s) and fileId(s). |
GET | /v1/files/{fileId} | Fetch file metadata and processing status (scanned/parsed/ready). |
GET | /v1/search | Search conversations/messages by query, filters (date, model, tags), and pagination. |
POST | /v1/share | Create a public share link for a conversation snapshot with optional redaction rules. |
GET | /v1/share/{shareId} | Retrieve shared conversation snapshot for public viewing (read-only). |
GET | /v1/usage | Return current usage, remaining quotas, and recent cost estimates for the authenticated user. |
GET | /v1/admin/metrics | Admin-only: aggregated metrics (DAU, messages, token volume, costs, provider error rates/latency). |
Global active-active deployment across multiple regions (at least 3) with GeoDNS steering to nearest healthy region. WebSocket Gateways scale horizontally behind Envoy with connection-aware load balancing; keep services stateless and store only ephemeral connection metadata in Redis. The hot path (quota check, context fetch, LLM streaming) is optimized for TTFT by: (1) precomputing and caching conversation summaries, (2) limiting context window with rolling summarization, (3) parallelizing context fetch and file metadata fetch, and (4) using hedged requests to LLM providers after a short delay when p95 latency rises. Conversation history writes are strongly consistent using distributed SQL with synchronous replication and tuned transaction boundaries (persist user message immediately; persist assistant message incrementally with periodic checkpoints, then finalize). Kafka decouples indexing/analytics and supports replay. Elasticsearch scales by sharding by tenant/time; ClickHouse scales by distributed tables and partitioning by date/model. Rate limiting uses Redis Cluster with key hashing by userId to spread load; per-tier policies are cached at gateways. For 500M messages/day, partition Kafka topics by conversationId hash, and use consumer groups for Search and Metering pipelines. LLM adapters implement circuit breakers, bulkheads per provider, and region-aware routing; self-hosted vLLM provides a fallback capacity pool for reliability and cost control.