Scale your self-hosted Prisme.ai platform to meet growing demands

As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.

Scaling Approaches

Horizontal Scaling

Horizontal scaling involves adding more instances (pods, nodes) to distribute load:

Benefits:

  • Better fault tolerance and availability
  • Linear capacity scaling
  • No downtime during scaling operations

Considerations:

  • Requires stateless application design
  • More complex networking
  • Service discovery requirements

Vertical Scaling

Vertical scaling involves increasing resources (CPU, memory) of existing instances:

Benefits:

  • Simpler to implement
  • Better for stateful components
  • Can address specific bottlenecks

Considerations:

  • Limited by maximum resource sizes
  • May require downtime during scaling
  • Cost efficiency diminishes at larger scales

When to Scale

Performance Indicators

Monitor these key metrics to identify scaling needs:

  • API response times exceeding thresholds
  • CPU utilization consistently above 70%
  • Memory utilization consistently above 80%
  • Request queue depth increasing
  • Database query times growing

Growth Indicators

Business metrics that suggest scaling requirements:

  • Increasing number of users
  • Growing document count
  • More concurrent sessions
  • Higher query volume
  • Additional knowledge bases

Preventative Scaling

Proactive scaling for anticipated demands:

  • Before major rollouts
  • Ahead of seasonal peaks
  • Prior to marketing campaigns
  • In advance of organizational growth

Recovery Objectives

Scaling to meet resilience targets:

  • Redundancy requirements
  • High availability goals
  • Load distribution needs
  • Geographic distribution objectives

Scaling Core Components

API & Worker Services

When scaling API and worker services, proper resource management is crucial for optimal performance. First, assess current usage by gathering metrics on performance and resource utilization. Configure Horizontal Pod Autoscaling (HPA) to enable automatic scaling based on CPU and memory metrics, setting appropriate minimum and maximum replica counts.

Update your Helm values to configure scaling parameters, including replica counts and autoscaling settings. Set proper resource requests and limits based on observed usage patterns, starting conservatively and adjusting based on monitoring data. Configure Pod Disruption Budgets to ensure high availability during scaling operations.

Product Modules

Each Prisme.ai product module can be scaled independently based on specific usage patterns. AI Knowledge requires scaling for document processing load and large knowledge bases, with tuning based on retrieval volume. AI SecureChat needs scaling based on concurrent user sessions and message throughput, considering message storage requirements.

AI Store scaling focuses on catalog browsing traffic and agent deployment operations, with attention to metadata storage needs. Specific workspaxces on AI Builder requires scaling for concurrent development sessions and complex builds, considering testing environment requirements.

Different products may require different scaling approaches based on their specific workloads and usage patterns.

Ingress & Networking

Ensure your ingress controller can handle increased traffic by scaling it appropriately. Configure connection pooling to optimize connection handling for scaled deployments, setting appropriate database pool sizes and Redis client limits. Implement Redis caching for frequently accessed data to reduce load on backend services.

Resource Optimization

Requests and Limits Configuration

Proper resource configuration is essential for effective scaling. Adjust CPU and memory limits for all core services and applications to accommodate the highest expected usage peaks. Set resource limits above the largest anticipated spikes to ensure services can handle peak loads without being throttled.

Configure resource requests equal to their limits to guarantee that pods are assigned to nodes with sufficient available resources for peak loads. This approach ensures consistent performance during high-traffic periods and prevents resource contention between pods on the same node.

Service Crawler Optimization

The crawler service requires specific tuning for optimal performance. The DOWNLOAD_DELAY variable controls the delay between requests and should be adjusted based on target crawl throughput. The REQUESTS_QUEUE_POLLING_SIZE determines how many requests are processed simultaneously, while REQUESTS_QUEUE_POLLING_INTERVAL sets the frequency of queue checks.

For typical document processing, such as a 100KB DOCX file containing 50,000 characters, recommended settings include a polling size of 8 requests, a download delay of 0.5 seconds, and a polling interval of 10 seconds. These values should be adjusted based on document types, processing time requirements, and target throughput.

Internal Cluster Communication

Optimize internal API calls by forcing all internal cluster communication to use HTTP instead of routing through Load Balancer HTTPS endpoints. Configure the INTERNAL_API_URL environment variable on all backend services to use internal service URLs, such as http://core-prismeai-api-gateway.core/v2.

This optimization provides faster network communication and reduces CPU overhead from HTTPS processing, particularly beneficial for high-frequency internal API calls during runtime operations.

Runtime Configuration

Readiness Probe Tuning

Configure readiness probes with appropriate timeouts to prevent pod termination during load spikes. Set probe timeouts to at least 3 seconds with 2-3 failure attempts allowed before considering a pod unhealthy. This flexibility prevents unnecessary pod restarts during temporary high-load conditions.

Throttle Management

Consider disabling runtime throttling globally or specifically for AI Knowledge and AI Store workspaces to improve performance under load. Alternatively, increase throttle limits according to your performance requirements and capacity planning. https://docs.prisme.ai/api-reference/rate-limits#configuration-options.

API Gateway Timeout Adjustment

The API gateway default timeout of 60 seconds may be insufficient for LLM calls that can exceed one minute. Adjust the timeout configuration in the core-prismeai-api-gateway-config ConfigMap to accommodate longer-running requests, typically setting it to 120 seconds or based on your specific LLM response time requirements.

Event Volume Management

Reduce the size of execution events that are primarily used for monitoring rather than functional purposes. The BROKER_EMIT_MAXLEN and BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN environment variable controls maximum event size, with a default of 10,000 characters for runtime.automations.executed (BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN) and 100,000 for all other events (BROKER_EMIT_MAXLEN). These defaults should be suitable for most monitoring needs while reducing storage and processing overhead.

Database Scaling

MongoDB Scaling

Implement MongoDB replica sets for high availability and read scaling, typically deploying with three replicas. For very large deployments, consider implementing MongoDB sharding with config servers, shard servers, and mongos routers, though this adds complexity and should only be used when dataset size exceeds single replica set capabilities.

Optimize database indexes for common queries, including user email lookups, document text searches, and agent queries by workspace and type. Scale MongoDB resources appropriately based on observed usage patterns and performance metrics.

Elasticsearch/OpenSearch Scaling

Index Lifecycle Management (ILM) Policies

We recommend implementing ILM policies to automate index rollover, deletion, and transition between storage tiers based on criteria such as index age, size, or activity. This helps:

  • Automatically remove outdated indices (e.g., logs older than 30 days),
  • Reduce disk usage and IOPS pressure,
  • Limit memory usage by keeping only active shards in RAM,
  • Maintain fast query response times in production environments.

Example of a basic ILM policy for daily logs:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "20gb"
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Kubernetes Cleaner Service Deployment

In addition to ILM, we provide a lightweight Kubernetes-native cleaner service that automatically:

  • Detects inactive or closed indices,
  • Removes old indices not yet covered by ILM,
  • Unloads stale shards from RAM to reduce heap pressure.

The service can be deployed as a CronJob or long-running sidecar. It connects to your cluster via API and applies retention or cleanup rules defined in a config file or environment variables. This is especially useful in high-ingestion environments where ILM alone is not sufficient to manage short-lived or redundant indices.

For deployment instructions, configuration examples, and security considerations, refer to the elastic-cleaner section of the deployment guide.

Optimize index settings

  • Scale your search cluster by adding more nodes and optimizing node roles. Configure dedicated master nodes for cluster management and data nodes for storage and search operations.
  • Optimize index settings including primary shard count, replica count, and refresh intervals based on your data volume and query patterns.
  • Implement Index Lifecycle Management (ILM) policies to automatically manage index aging, including hot, warm, cold, and delete phases.

Here’s a typical configuration to apply to an index (or datastream, as for AI Knowledge) to improve write performance :

  1. Retrieve your AI Knowledge (or other) index template configuration :
GET _index_template/index-template-events-<workspaceId> 
  1. Keep it, adjust existing configuration as needed and add the last template settings :
PUT _index_template/index-template-events-<workspaceId> 
{
  "index_patterns": [ ... ],
  "composed_of": [ ... ],  
  "priority": 1,
  "data_stream": {
    "hidden": false,
    "allow_custom_routing": false
  },
  "template": {
    "settings": {
      "index.number_of_shards": 3,
      "index.number_of_replicas": 1,
      "index.refresh_interval": "5s",
      "index.translog.durability": "async"
    }
  }  
}

Here, we configure the index template with 3 primary shards and 1 replica per primary, allowing you to distribute write traffic to all of your 3 nodes.
Decrease index.number_of_shards to 2 if you only have 2 nodes.
index.refresh_interval configures how often Elasticsearch will make your freshly written data available for search.

  1. Rollover your datastream in order to create a new index with the updated template :
POST /events-<workspaceId>/_rollover

Elasticsearch Self-Hosted Considerations

When running a self-hosted Elasticsearch or OpenSearch cluster, ensure nodes are distributed across different physical machines for proper redundancy. Use high-performance disks and monitor CPU iowait metrics to identify potential disk bottlenecks that could impact search performance.

Pay attention to cluster health metrics and ensure adequate disk space for index growth and operations like merging and replication.

Redis Scaling

Deploy Redis in cluster mode for horizontal scaling, implementing replication with multiple slave nodes for read scaling. Optimize Redis configuration including memory policies, connection limits, and persistence settings based on your caching requirements and data patterns.

Monitor Redis performance metrics including memory usage, connected clients, and command latency to identify scaling needs and potential bottlenecks.

Storage Scaling

Object Storage

S3 or compatible object storage typically scales automatically, but ensure proper configuration for performance and cost optimization. Enable transfer acceleration for faster uploads, use multipart uploads for large files, and implement appropriate file organization strategies.

Consider regional deployments for global access and implement lifecycle policies for cost optimization, using appropriate storage classes based on access patterns.

Persistent Volumes

Adjust storage for stateful components by expanding persistent volume claims where supported by the storage class. Monitor storage usage patterns and plan for growth, ensuring adequate space for database operations, backups, and temporary files.

Infrastructure Scaling with Terraform

Scale Kubernetes nodes by adjusting node groups in your Terraform configuration, setting appropriate minimum, maximum, and desired node counts based on workload requirements. Configure cluster autoscaling for automatic node provisioning based on pod resource requirements and scheduling constraints.

For global deployments, implement multi-region architecture with appropriate load balancing, database replication, and storage synchronization strategies.

Monitoring for Scaling Decisions

Key Metrics

Monitor core metrics that indicate scaling needs including API response times above 200ms, sustained CPU utilization above 70%, memory usage above 80%, increasing queue depths, and connection timeouts.

Monitoring Tools

Implement comprehensive monitoring using Prometheus and Grafana, Kubernetes metrics server, custom dashboards for Prisme.ai services, and database-specific monitoring tools.

Alert Thresholds

Set up alert thresholds to trigger scaling actions, with warnings at 60% resource utilization, critical alerts at 80% utilization, performance degradation above 50%, and error rate increases above 10%.

Scaling Dashboards

Create focused scaling dashboards showing resource usage trends, traffic patterns, database performance metrics, and storage growth rates to support scaling decisions.

Scaling Best Practices

Gradual Implementation

Implement scaling changes gradually rather than making large adjustments at once. Increase replicas by 50-100% increments, monitor effects before further scaling, allow systems to stabilize between changes, and document performance impacts for future reference.

Testing and Validation

Test scaling changes in non-production environments using load testing tools like JMeter, k6, or Locust. Simulate real-world usage patterns, test both scaling up and down scenarios, and verify application behavior during scaling events.

Automation

Use automation for routine scaling operations including Horizontal Pod Autoscalers, cluster autoscaling, scheduled scaling for predictable patterns, and anomaly detection for unexpected load increases.

Documentation

Maintain clear documentation for scaling operations including standard operating procedures, emergency scaling runbooks, performance baselines, and historical scaling decisions with their outcomes.

Next Steps

Continue with platform operations by implementing regular updates to keep your platform current, and establish comprehensive backup and restore strategies to protect your data and ensure business continuity.