Docker & Kubernetes Deployment
High Availability on Kubernetes
Designing and deploying Prisme.ai for high availability and fault tolerance on Kubernetes clusters.
Ensuring high availability (HA) is crucial for maintaining uptime, minimizing service disruption, and supporting critical workloads in production. This guide explains how to design, deploy, and operate Prisme.ai with HA in mind on Kubernetes.
Why High Availability Matters
A high-availability deployment ensures:
- No single point of failure
- Redundant services distributed across zones or regions
- Automated failover and recovery
- Increased uptime for business-critical use cases
Key HA Components
Kubernetes Cluster Architecture
Kubernetes Cluster Architecture
- Use multi-zone clusters: Spread nodes across at least 2 availability zones.
- Control Plane Resilience: Managed Kubernetes services like EKS, GKE, AKS provide HA for control planes by default.
- Auto-scaling enabled: Ensure node groups scale with demand to avoid overloads.
Microservice Redundancy
Microservice Redundancy
- Replicas: Deploy multiple replicas of core services:
- API Gateway: 2+
- Runtime: 3+
- Console/Studio: 2+
- Pages: 2+
- Use Kubernetes Deployments with readiness and liveness probes.
Load Balancing & Ingress
Load Balancing & Ingress
- Ingress controllers: Use cloud-native ingress controllers (e.g., ALB, NGINX, Istio) with health checks and retries.
- DNS: Configure DNS to route traffic across healthy zones (e.g., Route 53 with latency-based routing).
Stateless vs Stateful Services
Stateless vs Stateful Services
- Stateless services (API, Console, Pages, Runtime) should be scaled freely.
- Stateful services (MongoDB, Redis, Elasticsearch) must use clustered, HA configurations with persistent storage.
Storage Availability
Storage Availability
- Use replicated storage backends like:
- AWS EBS Multi-AZ (via EFS or FSx)
- Azure Files with zone-redundant storage
- GCP Filestore with regional redundancy
Example: Minimal HA Setup
Resilient Databases
MongoDB Replica Set
- Deploy MongoDB as a 3-node replica set.
- Use StatefulSets and persistent volumes.
- Prefer managed services with automatic failover.
Elasticsearch Cluster
- Use 3 data nodes and 3 master nodes.
- Enable snapshot-based backups.
- Ensure cluster quorum during restarts or scaling.
Redis HA
- Use Redis Sentinel or Redis Cluster.
- Use persistent storage and multi-zone replication.
- Prefer managed Redis services like AWS ElastiCache or Azure Cache for Redis.
Storage Redundancy
- Ensure shared volumes (for uploads or workspace files) are RWX and support replication.
- Use cloud-native backup and snapshot solutions.
Monitoring and Self-Healing
- Use Prometheus and Grafana for live dashboards and alerting.
- Implement Kubernetes PodDisruptionBudgets (PDBs) to prevent all pods from being evicted at once.
- Add Horizontal Pod Autoscalers (HPA) for runtime services.
Example PodDisruptionBudget: