Scaling
Scale your self-hosted Prisme.ai platform to meet growing demands
As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.
Scaling Approaches
Horizontal scaling involves adding more instances (pods, nodes) to distribute load:
Benefits:
- Better fault tolerance and availability
- Linear capacity scaling
- No downtime during scaling operations
Considerations:
- Requires stateless application design
- More complex networking
- Service discovery requirements
Horizontal scaling involves adding more instances (pods, nodes) to distribute load:
Benefits:
- Better fault tolerance and availability
- Linear capacity scaling
- No downtime during scaling operations
Considerations:
- Requires stateless application design
- More complex networking
- Service discovery requirements
Vertical scaling involves increasing resources (CPU, memory) of existing instances:
Benefits:
- Simpler to implement
- Better for stateful components
- Can address specific bottlenecks
Considerations:
- Limited by maximum resource sizes
- May require downtime during scaling
- Cost efficiency diminishes at larger scales
When to Scale
Performance Indicators
Monitor these key metrics to identify scaling needs:
- API response times exceeding thresholds
- CPU utilization consistently above 70%
- Memory utilization consistently above 80%
- Request queue depth increasing
- Database query times growing
Growth Indicators
Business metrics that suggest scaling requirements:
- Increasing number of users
- Growing document count
- More concurrent sessions
- Higher query volume
- Additional knowledge bases
Preventative Scaling
Proactive scaling for anticipated demands:
- Before major rollouts
- Ahead of seasonal peaks
- Prior to marketing campaigns
- In advance of organizational growth
Recovery Objectives
Scaling to meet resilience targets:
- Redundancy requirements
- High availability goals
- Load distribution needs
- Geographic distribution objectives
Scaling Core Components
Assess Current Usage
Gather metrics on current performance and resource utilization:
Configure HPA (Horizontal Pod Autoscaler)
Set up automatic scaling based on metrics:
Apply the configuration:
Update Helm Values
Alternatively, configure scaling parameters in your Helm values:
Apply the configuration:
Set Resource Requests and Limits
Define appropriate resource allocations:
Set resource requests and limits based on observed usage patterns. Start conservative and adjust based on monitoring data.
Configure Pod Disruption Budgets
Ensure high availability during scaling:
Assess Current Usage
Gather metrics on current performance and resource utilization:
Configure HPA (Horizontal Pod Autoscaler)
Set up automatic scaling based on metrics:
Apply the configuration:
Update Helm Values
Alternatively, configure scaling parameters in your Helm values:
Apply the configuration:
Set Resource Requests and Limits
Define appropriate resource allocations:
Set resource requests and limits based on observed usage patterns. Start conservative and adjust based on monitoring data.
Configure Pod Disruption Budgets
Ensure high availability during scaling:
Scale Individual Products
Each Prisme.ai product module can be scaled independently:
Using Helm values:
Apply the configuration:
Scale Based on Product Usage
Different products may require different scaling approaches:
AI Knowledge
- Scale for document processing load
- Increase resources for large knowledge bases
- Tune based on retrieval volume
AI SecureChat
- Scale based on concurrent user sessions
- Provision for message throughput
- Consider message storage requirements
AI Store
- Scale for catalog browsing traffic
- Provision for agent deployment operations
- Consider metadata storage needs
AI Builder
- Scale for concurrent development sessions
- Increase resources for complex builds
- Consider testing environment requirements
Scale Ingress Controller
Ensure your ingress controller can handle increased traffic:
Configure Connection Pooling
Optimize connection handling for scaled deployments:
Implement Caching
Add Redis caching for frequently accessed data:
Scaling Database Components
Implement Replica Sets
Deploy MongoDB with replica sets for high availability and read scaling:
Configure Sharding
For very large deployments, implement MongoDB sharding:
- Set up config servers (typically 3 nodes)
- Deploy shard servers (multiple replica sets)
- Configure mongos routers
- Define shard keys based on data access patterns
Sharding adds complexity and should only be implemented when dataset size exceeds what a single replica set can handle efficiently.
Optimize Indexes
Ensure proper indexes exist for common queries:
Scale MongoDB Resources
Increase resources for MongoDB instances:
Implement Replica Sets
Deploy MongoDB with replica sets for high availability and read scaling:
Configure Sharding
For very large deployments, implement MongoDB sharding:
- Set up config servers (typically 3 nodes)
- Deploy shard servers (multiple replica sets)
- Configure mongos routers
- Define shard keys based on data access patterns
Sharding adds complexity and should only be implemented when dataset size exceeds what a single replica set can handle efficiently.
Optimize Indexes
Ensure proper indexes exist for common queries:
Scale MongoDB Resources
Increase resources for MongoDB instances:
Scale Cluster Size
Add more nodes to your Elasticsearch/OpenSearch cluster:
Configure Node Roles
Optimize cluster by separating node roles:
- Master nodes: Cluster management
- Data nodes: Store and search data
- Coordinating nodes: Handle queries and distribute load
- Ingest nodes: Pre-process documents
Optimize Index Settings
Configure index settings for optimal performance:
Implement Index Lifecycle Management
Set up ILM policies for managing growing indices:
Implement Redis Cluster
Deploy Redis in cluster mode for horizontal scaling:
Optimize Redis Configuration
Tune Redis settings for performance:
Monitor and Scale
Set up monitoring to detect Redis bottlenecks:
Scaling Storage
Scale Object Storage
S3 or compatible object storage typically scales automatically, but ensure proper configuration:
Performance Options
- Enable transfer acceleration
- Use multipart uploads for large files
- Implement appropriate file organization
- Consider regional deployments for global access
Cost Optimization
- Implement lifecycle policies
- Use appropriate storage classes
- Enable compression where applicable
- Monitor usage patterns
Scale Persistent Volumes
Adjust storage for stateful components:
Not all storage classes support volume expansion. Check your cloud provider or storage system capabilities.
Scaling Infrastructure with Terraform
Scale Kubernetes Nodes
Adjust your node groups in Terraform:
Configure Node Autoscaling
Set up cluster autoscaler for automatic node provisioning:
Implement Regional Deployments
For global deployments, consider multi-region architecture:
- Deploy Prisme.ai in multiple regions
- Use global load balancing (e.g., Route53, Azure Traffic Manager)
- Replicate databases across regions
- Synchronize object storage
Monitoring for Scaling Decisions
Key Metrics to Watch
Core metrics that indicate scaling needs:
- API response time > 200ms
- CPU utilization > 70% sustained
- Memory usage > 80% sustained
- Queue depth increasing
- Connection timeouts occurring
Monitoring Tools
Tools to implement for scaling insights:
- Prometheus + Grafana
- Kubernetes metrics server
- Custom dashboards for Prisme.ai services
- Database-specific monitoring
Alert Thresholds
Set up alerts to trigger scaling actions:
- Warning: 60% resource utilization
- Critical: 80% resource utilization
- Performance degradation > 50%
- Error rate increase > 10%
Scaling Dashboards
Create dashboards focused on scaling metrics:
- Resource usage trends
- Traffic patterns
- Database performance
- Storage growth rates
Scaling Best Practices
Implement Gradual Scaling
Scale resources incrementally rather than making large changes at once:
- Increase replicas by 50-100% at a time
- Monitor effects before further scaling
- Allow system to stabilize between changes
- Document performance impacts
Test Before Production
Validate scaling changes in non-production environments:
- Use load testing tools (JMeter, k6, Locust)
- Simulate real-world usage patterns
- Test both scaling up and scaling down
- Verify application behavior during scaling events
Automate Where Possible
Use automation to handle routine scaling:
- Implement Horizontal Pod Autoscalers (HPA)
- Configure cluster autoscaling
- Use scheduled scaling for predictable patterns
- Set up anomaly detection for unexpected loads
Document Scaling Procedures
Maintain clear documentation for scaling operations:
- Standard operating procedures
- Emergency scaling runbooks
- Performance baselines
- Historical scaling decisions and outcomes
Next Steps
Was this page helpful?