Scaling
Scale your self-hosted Prisme.ai platform to meet growing demands
Scale your self-hosted Prisme.ai platform to meet growing demands
As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.
Scaling Approaches
Horizontal Scaling
Horizontal scaling involves adding more instances (pods, nodes) to distribute load:
Benefits:
- Better fault tolerance and availability
- Linear capacity scaling
- No downtime during scaling operations
Considerations:
- Requires stateless application design
- More complex networking
- Service discovery requirements
Vertical Scaling
Vertical scaling involves increasing resources (CPU, memory) of existing instances:
Benefits:
- Simpler to implement
- Better for stateful components
- Can address specific bottlenecks
Considerations:
- Limited by maximum resource sizes
- May require downtime during scaling
- Cost efficiency diminishes at larger scales
When to Scale
Performance Indicators
Monitor these key metrics to identify scaling needs:
- API response times exceeding thresholds
- CPU utilization consistently above 70%
- Memory utilization consistently above 80%
- Request queue depth increasing
- Database query times growing
Growth Indicators
Business metrics that suggest scaling requirements:
- Increasing number of users
- Growing document count
- More concurrent sessions
- Higher query volume
- Additional knowledge bases
Preventative Scaling
Proactive scaling for anticipated demands:
- Before major rollouts
- Ahead of seasonal peaks
- Prior to marketing campaigns
- In advance of organizational growth
Recovery Objectives
Scaling to meet resilience targets:
- Redundancy requirements
- High availability goals
- Load distribution needs
- Geographic distribution objectives
Scaling Core Components
API & Worker Services
When scaling API and worker services, proper resource management is crucial for optimal performance. First, assess current usage by gathering metrics on performance and resource utilization. Configure Horizontal Pod Autoscaling (HPA) to enable automatic scaling based on CPU and memory metrics, setting appropriate minimum and maximum replica counts.
Update your Helm values to configure scaling parameters, including replica counts and autoscaling settings. Set proper resource requests and limits based on observed usage patterns, starting conservatively and adjusting based on monitoring data. Configure Pod Disruption Budgets to ensure high availability during scaling operations.
Product Modules
Each Prisme.ai product module can be scaled independently based on specific usage patterns. AI Knowledge requires scaling for document processing load and large knowledge bases, with tuning based on retrieval volume. AI SecureChat needs scaling based on concurrent user sessions and message throughput, considering message storage requirements.
AI Store scaling focuses on catalog browsing traffic and agent deployment operations, with attention to metadata storage needs. Specific workspaxces on AI Builder requires scaling for concurrent development sessions and complex builds, considering testing environment requirements.
Different products may require different scaling approaches based on their specific workloads and usage patterns.
Ingress & Networking
Ensure your ingress controller can handle increased traffic by scaling it appropriately. Configure connection pooling to optimize connection handling for scaled deployments, setting appropriate database pool sizes and Redis client limits. Implement Redis caching for frequently accessed data to reduce load on backend services.
Resource Optimization
Requests and Limits Configuration
Proper resource configuration is essential for effective scaling. Adjust CPU and memory limits for all core services and applications to accommodate the highest expected usage peaks. Set resource limits above the largest anticipated spikes to ensure services can handle peak loads without being throttled.
Configure resource requests equal to their limits to guarantee that pods are assigned to nodes with sufficient available resources for peak loads. This approach ensures consistent performance during high-traffic periods and prevents resource contention between pods on the same node.
Service Crawler Optimization
The crawler service requires specific tuning for optimal performance. The DOWNLOAD_DELAY
variable controls the delay between requests and should be adjusted based on target crawl throughput. The REQUESTS_QUEUE_POLLING_SIZE
determines how many requests are processed simultaneously, while REQUESTS_QUEUE_POLLING_INTERVAL
sets the frequency of queue checks.
For typical document processing, such as a 100KB DOCX file containing 50,000 characters, recommended settings include a polling size of 8 requests, a download delay of 0.5 seconds, and a polling interval of 10 seconds. These values should be adjusted based on document types, processing time requirements, and target throughput.
Internal Cluster Communication
Optimize internal API calls by forcing all internal cluster communication to use HTTP instead of routing through Load Balancer HTTPS endpoints. Configure the INTERNAL_API_URL
environment variable on all backend services to use internal service URLs, such as http://core-prismeai-api-gateway.core/v2
.
This optimization provides faster network communication and reduces CPU overhead from HTTPS processing, particularly beneficial for high-frequency internal API calls during runtime operations.
Runtime Configuration
Readiness Probe Tuning
Configure readiness probes with appropriate timeouts to prevent pod termination during load spikes. Set probe timeouts to at least 3 seconds with 2-3 failure attempts allowed before considering a pod unhealthy. This flexibility prevents unnecessary pod restarts during temporary high-load conditions.
Throttle Management
Consider disabling runtime throttling globally or specifically for AI Knowledge and AI Store workspaces to improve performance under load. Alternatively, increase throttle limits according to your performance requirements and capacity planning. https://docs.prisme.ai/api-reference/rate-limits#configuration-options.
API Gateway Timeout Adjustment
The API gateway default timeout of 60 seconds may be insufficient for LLM calls that can exceed one minute. Adjust the timeout configuration in the core-prismeai-api-gateway-config
ConfigMap to accommodate longer-running requests, typically setting it to 120 seconds or based on your specific LLM response time requirements.
Event Volume Management
Reduce the size of execution events that are primarily used for monitoring rather than functional purposes. The BROKER_EMIT_MAXLEN
and BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN
environment variable controls maximum event size, with a default of 10,000 characters for runtime.automations.executed
(BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN
) and 100,000 for all other events (BROKER_EMIT_MAXLEN
). These defaults should be suitable for most monitoring needs while reducing storage and processing overhead.
Database Scaling
MongoDB Scaling
Implement MongoDB replica sets for high availability and read scaling, typically deploying with three replicas. For very large deployments, consider implementing MongoDB sharding with config servers, shard servers, and mongos routers, though this adds complexity and should only be used when dataset size exceeds single replica set capabilities.
Optimize database indexes for common queries, including user email lookups, document text searches, and agent queries by workspace and type. Scale MongoDB resources appropriately based on observed usage patterns and performance metrics.
Elasticsearch/OpenSearch Scaling
Index Lifecycle Management (ILM) Policies
We recommend implementing ILM policies to automate index rollover, deletion, and transition between storage tiers based on criteria such as index age, size, or activity. This helps:
- Automatically remove outdated indices (e.g., logs older than 30 days),
- Reduce disk usage and IOPS pressure,
- Limit memory usage by keeping only active shards in RAM,
- Maintain fast query response times in production environments.
Example of a basic ILM policy for daily logs:
Kubernetes Cleaner Service Deployment
In addition to ILM, we provide a lightweight Kubernetes-native cleaner service that automatically:
- Detects inactive or closed indices,
- Removes old indices not yet covered by ILM,
- Unloads stale shards from RAM to reduce heap pressure.
The service can be deployed as a CronJob or long-running sidecar. It connects to your cluster via API and applies retention or cleanup rules defined in a config file or environment variables. This is especially useful in high-ingestion environments where ILM alone is not sufficient to manage short-lived or redundant indices.
For deployment instructions, configuration examples, and security considerations, refer to the elastic-cleaner
section of the deployment guide.
Optimize index settings
- Scale your search cluster by adding more nodes and optimizing node roles. Configure dedicated master nodes for cluster management and data nodes for storage and search operations.
- Optimize index settings including primary shard count, replica count, and refresh intervals based on your data volume and query patterns.
- Implement Index Lifecycle Management (ILM) policies to automatically manage index aging, including hot, warm, cold, and delete phases.
Here’s a typical configuration to apply to an index (or datastream, as for AI Knowledge) to improve write performance :
- Retrieve your AI Knowledge (or other) index template configuration :
- Keep it, adjust existing configuration as needed and add the last template settings :
Here, we configure the index template with 3 primary shards and 1 replica per primary, allowing you to distribute write traffic to all of your 3 nodes.
Decrease index.number_of_shards
to 2 if you only have 2 nodes.
index.refresh_interval
configures how often Elasticsearch will make your freshly written data available for search.
- Rollover your datastream in order to create a new index with the updated template :
Elasticsearch Self-Hosted Considerations
When running a self-hosted Elasticsearch or OpenSearch cluster, ensure nodes are distributed across different physical machines for proper redundancy. Use high-performance disks and monitor CPU iowait metrics to identify potential disk bottlenecks that could impact search performance.
Pay attention to cluster health metrics and ensure adequate disk space for index growth and operations like merging and replication.
Redis Scaling
Deploy Redis in cluster mode for horizontal scaling, implementing replication with multiple slave nodes for read scaling. Optimize Redis configuration including memory policies, connection limits, and persistence settings based on your caching requirements and data patterns.
Monitor Redis performance metrics including memory usage, connected clients, and command latency to identify scaling needs and potential bottlenecks.
Storage Scaling
Object Storage
S3 or compatible object storage typically scales automatically, but ensure proper configuration for performance and cost optimization. Enable transfer acceleration for faster uploads, use multipart uploads for large files, and implement appropriate file organization strategies.
Consider regional deployments for global access and implement lifecycle policies for cost optimization, using appropriate storage classes based on access patterns.
Persistent Volumes
Adjust storage for stateful components by expanding persistent volume claims where supported by the storage class. Monitor storage usage patterns and plan for growth, ensuring adequate space for database operations, backups, and temporary files.
Infrastructure Scaling with Terraform
Scale Kubernetes nodes by adjusting node groups in your Terraform configuration, setting appropriate minimum, maximum, and desired node counts based on workload requirements. Configure cluster autoscaling for automatic node provisioning based on pod resource requirements and scheduling constraints.
For global deployments, implement multi-region architecture with appropriate load balancing, database replication, and storage synchronization strategies.
Monitoring for Scaling Decisions
Key Metrics
Monitor core metrics that indicate scaling needs including API response times above 200ms, sustained CPU utilization above 70%, memory usage above 80%, increasing queue depths, and connection timeouts.
Monitoring Tools
Implement comprehensive monitoring using Prometheus and Grafana, Kubernetes metrics server, custom dashboards for Prisme.ai services, and database-specific monitoring tools.
Alert Thresholds
Set up alert thresholds to trigger scaling actions, with warnings at 60% resource utilization, critical alerts at 80% utilization, performance degradation above 50%, and error rate increases above 10%.
Scaling Dashboards
Create focused scaling dashboards showing resource usage trends, traffic patterns, database performance metrics, and storage growth rates to support scaling decisions.
Scaling Best Practices
Gradual Implementation
Implement scaling changes gradually rather than making large adjustments at once. Increase replicas by 50-100% increments, monitor effects before further scaling, allow systems to stabilize between changes, and document performance impacts for future reference.
Testing and Validation
Test scaling changes in non-production environments using load testing tools like JMeter, k6, or Locust. Simulate real-world usage patterns, test both scaling up and down scenarios, and verify application behavior during scaling events.
Automation
Use automation for routine scaling operations including Horizontal Pod Autoscalers, cluster autoscaling, scheduled scaling for predictable patterns, and anomaly detection for unexpected load increases.
Documentation
Maintain clear documentation for scaling operations including standard operating procedures, emergency scaling runbooks, performance baselines, and historical scaling decisions with their outcomes.
Next Steps
Continue with platform operations by implementing regular updates to keep your platform current, and establish comprehensive backup and restore strategies to protect your data and ensure business continuity.