Scaling Approaches
Horizontal Scaling
Horizontal scaling involves adding more instances (pods, nodes) to distribute load: Benefits:- Better fault tolerance and availability
- Linear capacity scaling
- No downtime during scaling operations
- Requires stateless application design
- More complex networking
- Service discovery requirements
Vertical Scaling
Vertical scaling involves increasing resources (CPU, memory) of existing instances: Benefits:- Simpler to implement
- Better for stateful components
- Can address specific bottlenecks
- Limited by maximum resource sizes
- May require downtime during scaling
- Cost efficiency diminishes at larger scales
When to Scale
Performance Indicators
Monitor these key metrics to identify scaling needs:- API response times exceeding thresholds
- CPU utilization consistently above 70%
- Memory utilization consistently above 80%
- Request queue depth increasing
- Database query times growing
Growth Indicators
Business metrics that suggest scaling requirements:- Increasing number of users
- Growing document count
- More concurrent sessions
- Higher query volume
- Additional knowledge bases
Preventative Scaling
Proactive scaling for anticipated demands:- Before major rollouts
- Ahead of seasonal peaks
- Prior to marketing campaigns
- In advance of organizational growth
Recovery Objectives
Scaling to meet resilience targets:- Redundancy requirements
- High availability goals
- Load distribution needs
- Geographic distribution objectives
Scaling Core Components
API & Worker Services
When scaling API and worker services, proper resource management is crucial for optimal performance. First, assess current usage by gathering metrics on performance and resource utilization. Configure Horizontal Pod Autoscaling (HPA) to enable automatic scaling based on CPU and memory metrics, setting appropriate minimum and maximum replica counts. Update your Helm values to configure scaling parameters, including replica counts and autoscaling settings. Set proper resource requests and limits based on observed usage patterns, starting conservatively and adjusting based on monitoring data. Configure Pod Disruption Budgets to ensure high availability during scaling operations.Product Modules
Each Prisme.ai product module can be scaled independently based on specific usage patterns. AI Knowledge requires scaling for document processing load and large knowledge bases, with tuning based on retrieval volume. AI SecureChat needs scaling based on concurrent user sessions and message throughput, considering message storage requirements. AI Store scaling focuses on catalog browsing traffic and agent deployment operations, with attention to metadata storage needs. Specific workspaxces on AI Builder requires scaling for concurrent development sessions and complex builds, considering testing environment requirements. Different products may require different scaling approaches based on their specific workloads and usage patterns.Ingress & Networking
Ensure your ingress controller can handle increased traffic by scaling it appropriately. Configure connection pooling to optimize connection handling for scaled deployments, setting appropriate database pool sizes and Redis client limits. Implement Redis caching for frequently accessed data to reduce load on backend services.Resource Optimization
Requests and Limits Configuration
Proper resource configuration is essential for effective scaling. Adjust CPU and memory limits for all core services and applications to accommodate the highest expected usage peaks. Set resource limits above the largest anticipated spikes to ensure services can handle peak loads without being throttled. Configure resource requests equal to their limits to guarantee that pods are assigned to nodes with sufficient available resources for peak loads. This approach ensures consistent performance during high-traffic periods and prevents resource contention between pods on the same node.Service Crawler Optimization
The crawler service requires specific tuning for optimal performance. TheDOWNLOAD_DELAY
variable controls the delay between requests and should be adjusted based on target crawl throughput. The REQUEST_QUEUES_POLLING_SIZE
determines how many requests are processed simultaneously, while REQUEST_QUEUES_POLLING_INTERVAL
sets the frequency of queue checks.
For typical document processing, such as a 100KB DOCX file containing 50,000 characters, recommended settings include a polling size of 8 requests, a download delay of 0.5 seconds, and a polling interval of 10 seconds. These values should be adjusted based on document types, processing time requirements, and target throughput.
Internal Cluster Communication
Optimize internal API calls by forcing all internal cluster communication to use HTTP instead of routing through Load Balancer HTTPS endpoints. Configure theINTERNAL_API_URL
environment variable on all services to use internal service URLs, such as http://core-prismeai-api-gateway.core/v2
.
This optimization provides faster network communication and reduces CPU overhead from HTTPS processing, particularly beneficial for high-frequency internal API calls during runtime operations.
Runtime Configuration
Readiness Probe Tuning
Configure readiness probes with appropriate timeouts to prevent pod termination during load spikes. Set probe timeouts to at least 3 seconds with 2-3 failure attempts allowed before considering a pod unhealthy. This flexibility prevents unnecessary pod restarts during temporary high-load conditions.Throttle Management
Consider disabling runtime throttling globally or specifically for AI Knowledge and AI Store workspaces to improve performance under load. Alternatively, increase throttle limits according to your performance requirements and capacity planning. https://docs.prisme.ai/api-reference/rate-limits#configuration-options.API Gateway Timeout Adjustment
The API gateway default timeout of 60 seconds may be insufficient for LLM calls that can exceed one minute. Adjust the timeout configuration in thecore-prismeai-api-gateway-config
ConfigMap to accommodate longer-running requests, typically setting it to 120 seconds or based on your specific LLM response time requirements.
Event Volume Management
Reduce the size of execution events that are primarily used for monitoring rather than functional purposes. TheBROKER_EMIT_MAXLEN
and BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN
environment variable controls maximum event size, with a default of 10,000 characters for runtime.automations.executed
(BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN
) and 100,000 for all other events (BROKER_EMIT_MAXLEN
). These defaults should be suitable for most monitoring needs while reducing storage and processing overhead.
Database Scaling
MongoDB Scaling
Implement MongoDB replica sets for high availability and read scaling, typically deploying with three replicas. For very large deployments, consider implementing MongoDB sharding with config servers, shard servers, and mongos routers, though this adds complexity and should only be used when dataset size exceeds single replica set capabilities. Optimize database indexes for common queries, including user email lookups, document text searches, and agent queries by workspace and type. Scale MongoDB resources appropriately based on observed usage patterns and performance metrics.Elasticsearch/OpenSearch Scaling
Volume formatting
When formatting the Elasticsearch/Opensearch filesystem volume, it is important to first shutdown the prismeai-events microservice in core namespace. This can be easily done from Kubernetes by editing the deployment and settingreplicas: 0
.
Index mappings are initialized when prismeai-events starts up. If indexes and index mappings are deleted (such as when formatting a volume) without first stopping prismeai-events, the next event persistence request sent to the cluster would cause it to automatically infer an incorrect index mapping, which would cause most other persistence requests to fail and result in data loss. This is because when Prisme.ai events mapping are not initialized before first events write requests, ES/OS automatically infer mappings for payload.*
nested fields, making it incompatible with the needed flattened
(ES) / flat_object
(OS) mapping on the entire payload
field.
This situation causes errors like Limit of total fields [1000] has been exceeded
as ES/OS tries to map every single payload.*
nested field until reaching the 1000 fields maximum limit.If it is possible to delete the events data (includes workspaces debug events and AIK usage metrics), this can be easily solved by :
- Shutting down prismeai-events by editing the deployment and setting
replicas: 0
. If a HorizontalPodAutoscaler exists for prismeai-events, first delete it or set its min/max replicas to 0. - Removing every failed index/datastream either from Kibana or curl :
_index
field of prismeai-events error logs.Names starting with
.ds-events-
are datastreams underlying indexes and can be renamed like this to delete all underlying indexes at once : .ds-events-<id>-000001
-> events-<id>
- Restarting prismeai-events by editing the deployment and setting
replicas: 1
Reindexing events with default mapping
Follow these steps in order to reindex a workspace events datastream with the default index settings & mappings initialized (or updated) byprismeai-events
:
- Find the correct name for the datastream you want to reindex and make sure it exists :
- List existing index templates and make sure an index template exists for your
events-*
pattern :
Some workspaces like AI Knowledge have a custom specific index template tuned for their needs, with the workspace id included in their index & component template name.
composed_of
) to all indices matching index_patterns
.The component template is where Prisme.ai custom index settings & mappings are configured.
- Create a temporary datastream :
index_patterns
seen above so this new datastream will inherit default index settings & mappings.
- Reindex your data from the current to the temporary & remapped datastream :
"conflicts": "proceed"
option to the body in order to ignore documents already created in destination index.
A response with {"failures": []}
indicates all data have been reindexed & match the destination mapping.In case of a mismatch between source data and destination mapping, you can receive error response like this :
- Delete the current datastream :
- Clone our temporary datastream to the “current” datastream exactly like we previously did the other way around :
Make sure the response
failures
is an empty []
and total
is the same as the first _reindex total
- hits.total.value: the number of matching documents
- aggregations.latestDate.value_as_string : The latest matching document date
- aggregations.oldestDate.value_as_string : The oldest matching document date
- Check from your bowser that the target workspace events feed is not empty and contain old data, and that events previously failing to persist are now persisted.
If everything is fine, you can delete the temporary datastream :
Index Lifecycle Management (ILM) Policies
Prismeai automatically configures ILM policies to automate index rollover + segments merging when their primary shard reach 40GB, as recommended by Elasticsearch/Opensearch.Our Elasticsearch driver also configures an ILM policy to automate the events deletion 30 days (default, configurable with
EVENTS_SCHEDULED_DELETION_DAYS
) after workspace deletion.This is not yet supported by Opensearch driver, which deletes events as soon as the workspace is deleted. Events expiration is not configured from ILM as they do not offer the precision needed to tune different expiration periods depending on the different kind of data.
Instead, events expiration is enforced by prismeai-events
/sys/cleanup/*
APIs which are automatically called from a Kubernetes CronJob as described below.
Events automated cleanup
In addition to ILM, we provide a lightweight Kubernetes-native cleaner service that automatically:- Deletes expired events to enforce data retention regulation (i.e GDPR).
- Deletes datastreams from small & inactive workspaces to reduce shards usage and avoid reaching the 1000 shards per node limit.
- Removes
payload
andoutput
fields fromruntime.automations.executed
technical events to save disk space without compromising audit/debug capabilities in short term
cleanup-es-indices
Kubernetes CronJob scheduled every sunday at 0AM, while step 3 is executed from a cleanup-exec-events
CronJob every night at 3:30AM.
Optimize index settings
- Scale your search cluster by adding more nodes and optimizing node roles. Configure dedicated master nodes for cluster management and data nodes for storage and search operations.
- Optimize index settings including primary shard count, replica count, and refresh intervals based on your data volume and query patterns.
- Implement Index Lifecycle Management (ILM) policies to automatically manage index aging, including hot, warm, cold, and delete phases.
- Retrieve your AI Knowledge (or other) index template configuration :
- Keep it, adjust existing configuration as needed and add the last template settings :
Decrease
index.number_of_shards
to 2 if you only have 2 nodes.index.refresh_interval
configures how often Elasticsearch will make your freshly written data available for search.
- Rollover your datastream in order to create a new index with the updated template :