Scaling - Prisme.ai

Scale your self-hosted Prisme.ai platform to meet growing demands As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.

Scaling Approaches

Horizontal Scaling

Horizontal scaling involves adding more instances (pods, nodes) to distribute load: Benefits:

Better fault tolerance and availability
Linear capacity scaling
No downtime during scaling operations

Considerations:

Requires stateless application design
More complex networking
Service discovery requirements

Vertical Scaling

Vertical scaling involves increasing resources (CPU, memory) of existing instances: Benefits:

Simpler to implement
Better for stateful components
Can address specific bottlenecks

Considerations:

Limited by maximum resource sizes
May require downtime during scaling
Cost efficiency diminishes at larger scales

When to Scale

Performance Indicators

Monitor these key metrics to identify scaling needs:

API response times exceeding thresholds
CPU utilization consistently above 70%
Memory utilization consistently above 80%
Request queue depth increasing
Database query times growing

Growth Indicators

Business metrics that suggest scaling requirements:

Increasing number of users
Growing document count
More concurrent sessions
Higher query volume
Additional knowledge bases

Preventative Scaling

Proactive scaling for anticipated demands:

Before major rollouts
Ahead of seasonal peaks
Prior to marketing campaigns
In advance of organizational growth

Recovery Objectives

Scaling to meet resilience targets:

Redundancy requirements
High availability goals
Load distribution needs
Geographic distribution objectives

Scaling Core Components

API & Worker Services

When scaling API and worker services, proper resource management is crucial for optimal performance. First, assess current usage by gathering metrics on performance and resource utilization. Configure Horizontal Pod Autoscaling (HPA) to enable automatic scaling based on CPU and memory metrics, setting appropriate minimum and maximum replica counts. Update your Helm values to configure scaling parameters, including replica counts and autoscaling settings. Set proper resource requests and limits based on observed usage patterns, starting conservatively and adjusting based on monitoring data. Configure Pod Disruption Budgets to ensure high availability during scaling operations.

Product Modules

Each Prisme.ai product module can be scaled independently based on specific usage patterns. AI Knowledge requires scaling for document processing load and large knowledge bases, with tuning based on retrieval volume. AI SecureChat needs scaling based on concurrent user sessions and message throughput, considering message storage requirements. AI Store scaling focuses on catalog browsing traffic and agent deployment operations, with attention to metadata storage needs. Specific workspaxces on AI Builder requires scaling for concurrent development sessions and complex builds, considering testing environment requirements. Different products may require different scaling approaches based on their specific workloads and usage patterns.

Ingress & Networking

Ensure your ingress controller can handle increased traffic by scaling it appropriately. Configure connection pooling to optimize connection handling for scaled deployments, setting appropriate database pool sizes and Redis client limits. Implement Redis caching for frequently accessed data to reduce load on backend services.

Resource Optimization

Requests and Limits Configuration

Proper resource configuration is essential for effective scaling. Adjust CPU and memory limits for all core services and applications to accommodate the highest expected usage peaks. Set resource limits above the largest anticipated spikes to ensure services can handle peak loads without being throttled. Configure resource requests equal to their limits to guarantee that pods are assigned to nodes with sufficient available resources for peak loads. This approach ensures consistent performance during high-traffic periods and prevents resource contention between pods on the same node.

Service Crawler Optimization

The crawler service requires specific tuning for optimal performance. The DOWNLOAD_DELAY variable controls the delay between requests and should be adjusted based on target crawl throughput. The REQUEST_QUEUES_POLLING_SIZE determines how many requests are processed simultaneously, while REQUEST_QUEUES_POLLING_INTERVAL sets the frequency of queue checks. For typical document processing, such as a 100KB DOCX file containing 50,000 characters, recommended settings include a polling size of 8 requests, a download delay of 0.5 seconds, and a polling interval of 10 seconds. These values should be adjusted based on document types, processing time requirements, and target throughput.

Internal Cluster Communication

Optimize internal API calls by forcing all internal cluster communication to use HTTP instead of routing through Load Balancer HTTPS endpoints. Configure the INTERNAL_API_URL environment variable on all services to use internal service URLs, such as http://core-prismeai-api-gateway.core/v2. This optimization provides faster network communication and reduces CPU overhead from HTTPS processing, particularly beneficial for high-frequency internal API calls during runtime operations.

Runtime Configuration

Readiness Probe Tuning

Configure readiness probes with appropriate timeouts to prevent pod termination during load spikes. Set probe timeouts to at least 3 seconds with 2-3 failure attempts allowed before considering a pod unhealthy. This flexibility prevents unnecessary pod restarts during temporary high-load conditions.

Throttle Management

Consider disabling runtime throttling globally or specifically for AI Knowledge and AI Store workspaces to improve performance under load. Alternatively, increase throttle limits according to your performance requirements and capacity planning. https://docs.prisme.ai/api-reference/rate-limits#configuration-options.

API Gateway Timeout Adjustment

The API gateway default timeout of 60 seconds may be insufficient for LLM calls that can exceed one minute. Adjust the timeout configuration in the core-prismeai-api-gateway-config ConfigMap to accommodate longer-running requests, typically setting it to 120 seconds or based on your specific LLM response time requirements.

Event Volume Management

Reduce the size of execution events that are primarily used for monitoring rather than functional purposes. The BROKER_EMIT_MAXLEN and BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN environment variable controls maximum event size, with a default of 10,000 characters for runtime.automations.executed (BROKER_EMIT_EXECUTED_AUTOMATION_MAXLEN) and 100,000 for all other events (BROKER_EMIT_MAXLEN). These defaults should be suitable for most monitoring needs while reducing storage and processing overhead.

Database Scaling

MongoDB Scaling

Implement MongoDB replica sets for high availability and read scaling, typically deploying with three replicas. For very large deployments, consider implementing MongoDB sharding with config servers, shard servers, and mongos routers, though this adds complexity and should only be used when dataset size exceeds single replica set capabilities. Optimize database indexes for common queries, including user email lookups, document text searches, and agent queries by workspace and type. Scale MongoDB resources appropriately based on observed usage patterns and performance metrics.

Elasticsearch/OpenSearch Scaling

Volume formatting

When formatting the Elasticsearch/Opensearch filesystem volume, it is important to first shutdown the prismeai-events microservice in core namespace. This can be easily done from Kubernetes by editing the deployment and setting replicas: 0. Index mappings are initialized when prismeai-events starts up. If indexes and index mappings are deleted (such as when formatting a volume) without first stopping prismeai-events, the next event persistence request sent to the cluster would cause it to automatically infer an incorrect index mapping, which would cause most other persistence requests to fail and result in data loss. This is because when Prisme.ai events mapping are not initialized before first events write requests, ES/OS automatically infer mappings for payload.* nested fields, making it incompatible with the needed flattened (ES) / flat_object (OS) mapping on the entire payload field. This situation causes errors like Limit of total fields [1000] has been exceeded as ES/OS tries to map every single payload.* nested field until reaching the 1000 fields maximum limit.
If it is possible to delete the events data (includes workspaces debug events and AIK usage metrics), this can be easily solved by :

Shutting down prismeai-events by editing the deployment and setting replicas: 0. If a HorizontalPodAutoscaler exists for prismeai-events, first delete it or set its min/max replicas to 0.
Removing every failed index/datastream either from Kibana or curl :

DELETE /_data_stream/events-<id1>
DELETE /_data_stream/events-<id2>
...

Failed index names can be found in _index field of prismeai-events error logs.
Names starting with .ds-events- are datastreams underlying indexes and can be renamed like this to delete all underlying indexes at once : .ds-events-<id>-000001 -> events-<id>

Restarting prismeai-events by editing the deployment and setting replicas: 1

If events data cannot be lost, the existing indexes must me manually reindexed with the appropriate mapping.

Reindexing events with default mapping

Follow these steps in order to reindex a workspace events datastream with the default index settings & mappings initialized (or updated) by prismeai-events :

Find the correct name for the datastream you want to reindex and make sure it exists :

GET /events-<id1>/_search

List existing index templates and make sure an index template exists for your events-* pattern :

GET /_index_template

Example :

{
  "index_templates": [
    ...
    {
      "name": "index-template-events",
      "index_template": {
        "index_patterns": [
          "events-*"
        ],
        "composed_of": [
          "template-events"
        ],
        "priority": 1,
        "data_stream": {
          "timestamp_field": {
            "name": "@timestamp"
          }
        }
      }
    }
  ]
}

Some workspaces like AI Knowledge have a custom specific index template tuned for their needs, with the workspace id included in their index & component template name.

This index template automatically applies its configuration and component templates (composed_of) to all indices matching index_patterns.
The component template is where Prisme.ai custom index settings & mappings are configured.

Create a temporary datastream :

PUT _data_stream/events-<id1>-tmp

Make sure your temporary index name matches the index_patterns seen above so this new datastream will inherit default index settings & mappings.

Reindex your data from the current to the temporary & remapped datastream :

POST _reindex
{
  "source": {
    "index": "events-<id1>"
  },
  "dest": {
    "index": "events-<id1>-tmp",
    "op_type":"create"
  }
}

You can optionally add a query filter to the source, very useful if you want to drop all error & execution events (which can take lot of disk space & are only useful for debugging last few days activities) :

POST _reindex
{
  "source": {
    "index": "events-<id1>",
    "query": {
      "bool": {
        "must_not": [
          {
            "terms": {
              "type": [
                "runtime.automations.executed",
                "error"
              ]
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "events-<id1>-tmp",
    "op_type":"create"
  }
}

When retrying this request multiple times (for example with different source query to drop documents incompatible with the new mapping), you can add an "conflicts": "proceed" option to the body in order to ignore documents already created in destination index. A response with {"failures": []} indicates all data have been reindexed & match the destination mapping.
In case of a mismatch between source data and destination mapping, you can receive error response like this :

{
  "took": 79,
  "timed_out": false,
  "total": 12,
  "updated": 0,
  "created": 6,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": [
    {
      "index": ".ds-test-events-test-000001",
      "id": "1757668991388-0",
      "cause": {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [payload.output] of type [double] in document with id '1757668991388-0'. Preview of field's value: '{some={nested=field}, foo=bar}'",
        "caused_by": {
          "type": "json_parse_exception",
          "reason": "Current token (START_OBJECT) not numeric, can not use numeric value accessors\n at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 530]"
        }
      },
      "status": 400
    },
    ...
  ]
}

Either adapt the destination mapping or filter out these documents using the source query.

Delete the current datastream :

DELETE /_data_stream/events-<id1>

Clone our temporary datastream to the “current” datastream exactly like we previously did the other way around :

PUT _data_stream/events-<id1>
POST _reindex
{
  "source": {
    "index": "events-<id1>-tmp"
  },
  "dest": {
    "index": "events-<id1>",
    "op_type":"create"
  }
}

Make sure the response failures is an empty [] and total is the same as the first _reindex total

Optionally, check that some specific type of event has been recovered :

GET /events-<workspaceId>/_search 
{
  "track_total_hits": true, 
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "usage"
          }
        }
      ]
    }
  },
  "aggs": {
    "latestDate": {
      "max": {
        "field": "createdAt"
      }
    },
    "oldestDate": {
      "min": {
        "field": "createdAt"
      }
    }    
  }
} 

Example response :

{
  "hits": {
    "total": {
      "value": 668177, 
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "latestDate": {
      "value": 1758017828520,
      "value_as_string": "2025-09-16T10:17:08.520Z"
    },
    "oldestDate": {
      "value": 1725905374947,
      "value_as_string": "2024-09-09T18:09:34.947Z"
    }
  }
}

hits.total.value: the number of matching documents
aggregations.latestDate.value_as_string : The latest matching document date
aggregations.oldestDate.value_as_string : The oldest matching document date

Check from your bowser that the target workspace events feed is not empty and contain old data, and that events previously failing to persist are now persisted.
If everything is fine, you can delete the temporary datastream :

DELETE /_data_stream/events-<id1>-tmp

Index Lifecycle Management (ILM) Policies

Prismeai automatically configures ILM policies to automate index rollover + segments merging when their primary shard reach 40GB, as recommended by Elasticsearch/Opensearch.
Our Elasticsearch driver also configures an ILM policy to automate the events deletion 30 days (default, configurable with EVENTS_SCHEDULED_DELETION_DAYS) after workspace deletion.
This is not yet supported by Opensearch driver, which deletes events as soon as the workspace is deleted. Events expiration is not configured from ILM as they do not offer the precision needed to tune different expiration periods depending on the different kind of data.
Instead, events expiration is enforced by prismeai-events /sys/cleanup/* APIs which are automatically called from a Kubernetes CronJob as described below.

Events automated cleanup

In addition to ILM, we provide a lightweight Kubernetes-native cleaner service that automatically:

Deletes expired events to enforce data retention regulation (i.e GDPR).
Deletes datastreams from small & inactive workspaces to reduce shards usage and avoid reaching the 1000 shards per node limit.
Removes payload and output fields from runtime.automations.executed technical events to save disk space without compromising audit/debug capabilities in short term

These 3 tasks are configurable from helm values :

prismeai-events:
  ...
  events:
    cleanupjob: true # Create a cronjob to call /cleanup API in order to regularly apply retention, clean unused & inactive workspaces (see EVENTS_CLEANUP_* vars)...

    # 1. Delete events older than 3 years
    retention: 1080       

    # 2. delete all events from small AND inactive workspaces :
    workspaceMaxEvents: 50 # with max N events
    workspaceInactivityDays: 30 # & inactive for N days

    # 3. Delete payload & output fields from all runtime.automations.executed events older than :
    automationExecutedExpiration: '14d' 

Step 1 and 2 are executed from a cleanup-es-indices Kubernetes CronJob scheduled every sunday at 0AM, while step 3 is executed from a cleanup-exec-events CronJob every night at 3:30AM.

Optimize index settings

Scale your search cluster by adding more nodes and optimizing node roles. Configure dedicated master nodes for cluster management and data nodes for storage and search operations.
Optimize index settings including primary shard count, replica count, and refresh intervals based on your data volume and query patterns.
Implement Index Lifecycle Management (ILM) policies to automatically manage index aging, including hot, warm, cold, and delete phases.

Here’s a typical configuration to apply to an index (or datastream, as for AI Knowledge) to improve write performance :

Retrieve your AI Knowledge (or other) index template configuration :

GET _index_template/index-template-events-<workspaceId>

Keep it, adjust existing configuration as needed and add the last template settings :

PUT _index_template/index-template-events-<workspaceId> 
{
  "index_patterns": [ ... ],
  "composed_of": [ ... ],  
  "priority": 1,
  "data_stream": {
    "hidden": false,
    "allow_custom_routing": false
  },
  "template": {
    "settings": {
      "index.number_of_shards": 3,
      "index.number_of_replicas": 1,
      "index.refresh_interval": "5s",
      "index.translog.durability": "async"
    }
  }  
}

Here, we configure the index template with 3 primary shards and 1 replica per primary, allowing you to distribute write traffic to all of your 3 nodes.
Decrease index.number_of_shards to 2 if you only have 2 nodes.
index.refresh_interval configures how often Elasticsearch will make your freshly written data available for search.

Rollover your datastream in order to create a new index with the updated template :

POST /events-<workspaceId>/_rollover

Elasticsearch Self-Hosted Considerations

When running a self-hosted Elasticsearch or OpenSearch cluster, ensure nodes are distributed across different physical machines for proper redundancy. Use high-performance disks and monitor CPU iowait metrics to identify potential disk bottlenecks that could impact search performance. Pay attention to cluster health metrics and ensure adequate disk space for index growth and operations like merging and replication.

Redis Scaling

Deploy Redis in cluster mode for horizontal scaling, implementing replication with multiple slave nodes for read scaling. Optimize Redis configuration including memory policies, connection limits, and persistence settings based on your caching requirements and data patterns. Monitor Redis performance metrics including memory usage, connected clients, and command latency to identify scaling needs and potential bottlenecks.

Storage Scaling

Object Storage

S3 or compatible object storage typically scales automatically, but ensure proper configuration for performance and cost optimization. Enable transfer acceleration for faster uploads, use multipart uploads for large files, and implement appropriate file organization strategies. Consider regional deployments for global access and implement lifecycle policies for cost optimization, using appropriate storage classes based on access patterns.

Persistent Volumes

Adjust storage for stateful components by expanding persistent volume claims where supported by the storage class. Monitor storage usage patterns and plan for growth, ensuring adequate space for database operations, backups, and temporary files.

Infrastructure Scaling with Terraform

Scale Kubernetes nodes by adjusting node groups in your Terraform configuration, setting appropriate minimum, maximum, and desired node counts based on workload requirements. Configure cluster autoscaling for automatic node provisioning based on pod resource requirements and scheduling constraints. For global deployments, implement multi-region architecture with appropriate load balancing, database replication, and storage synchronization strategies.

Monitoring for Scaling Decisions

Key Metrics

Monitor core metrics that indicate scaling needs including API response times above 200ms, sustained CPU utilization above 70%, memory usage above 80%, increasing queue depths, and connection timeouts.

Monitoring Tools

Implement comprehensive monitoring using Prometheus and Grafana, Kubernetes metrics server, custom dashboards for Prisme.ai services, and database-specific monitoring tools.

Alert Thresholds

Set up alert thresholds to trigger scaling actions, with warnings at 60% resource utilization, critical alerts at 80% utilization, performance degradation above 50%, and error rate increases above 10%.

Scaling Dashboards

Create focused scaling dashboards showing resource usage trends, traffic patterns, database performance metrics, and storage growth rates to support scaling decisions.

Scaling Best Practices

Gradual Implementation

Implement scaling changes gradually rather than making large adjustments at once. Increase replicas by 50-100% increments, monitor effects before further scaling, allow systems to stabilize between changes, and document performance impacts for future reference.

Testing and Validation

Test scaling changes in non-production environments using load testing tools like JMeter, k6, or Locust. Simulate real-world usage patterns, test both scaling up and down scenarios, and verify application behavior during scaling events.

Automation

Use automation for routine scaling operations including Horizontal Pod Autoscalers, cluster autoscaling, scheduled scaling for predictable patterns, and anomaly detection for unexpected load increases.

Documentation

Maintain clear documentation for scaling operations including standard operating procedures, emergency scaling runbooks, performance baselines, and historical scaling decisions with their outcomes.

Next Steps

Continue with platform operations by implementing regular updates to keep your platform current, and establish comprehensive backup and restore strategies to protect your data and ensure business continuity.

Overview

Cloud Providers

Docker & Kubernetes Deployment

Entreprise Services

AI Products

Operations

​Scaling Approaches

​Horizontal Scaling

​Vertical Scaling

​When to Scale

​Performance Indicators

​Growth Indicators

​Preventative Scaling

​Recovery Objectives

​Scaling Core Components

​API & Worker Services

​Product Modules

​Ingress & Networking

​Resource Optimization

​Requests and Limits Configuration

​Service Crawler Optimization

​Internal Cluster Communication

​Runtime Configuration

​Readiness Probe Tuning

​Throttle Management

​API Gateway Timeout Adjustment

​Event Volume Management

​Database Scaling

​MongoDB Scaling

​Elasticsearch/OpenSearch Scaling

​Volume formatting

​Reindexing events with default mapping

​Index Lifecycle Management (ILM) Policies

​Events automated cleanup

​Optimize index settings

​Elasticsearch Self-Hosted Considerations

​Redis Scaling

​Storage Scaling

​Object Storage

​Persistent Volumes

​Infrastructure Scaling with Terraform

​Monitoring for Scaling Decisions

​Key Metrics

​Monitoring Tools

​Alert Thresholds

​Scaling Dashboards

​Scaling Best Practices

​Gradual Implementation

​Testing and Validation

​Automation

​Documentation

​Next Steps

Scaling Approaches

Horizontal Scaling

Vertical Scaling

When to Scale

Performance Indicators

Growth Indicators

Preventative Scaling

Recovery Objectives

Scaling Core Components

API & Worker Services

Product Modules

Ingress & Networking

Resource Optimization

Requests and Limits Configuration

Service Crawler Optimization

Internal Cluster Communication

Runtime Configuration

Readiness Probe Tuning

Throttle Management

API Gateway Timeout Adjustment

Event Volume Management

Database Scaling

MongoDB Scaling

Elasticsearch/OpenSearch Scaling

Volume formatting

Reindexing events with default mapping

Index Lifecycle Management (ILM) Policies

Events automated cleanup

Optimize index settings

Elasticsearch Self-Hosted Considerations

Redis Scaling

Storage Scaling

Object Storage

Persistent Volumes

Infrastructure Scaling with Terraform

Monitoring for Scaling Decisions

Key Metrics

Monitoring Tools

Alert Thresholds

Scaling Dashboards

Scaling Best Practices

Gradual Implementation

Testing and Validation

Automation

Documentation

Next Steps