As your organization’s usage of Prisme.ai grows, you’ll need to scale your self-hosted platform to maintain performance and reliability. This guide provides strategies and best practices for scaling different components of your Prisme.ai deployment.

Scaling Approaches

Horizontal scaling involves adding more instances (pods, nodes) to distribute load:

Benefits:

  • Better fault tolerance and availability
  • Linear capacity scaling
  • No downtime during scaling operations

Considerations:

  • Requires stateless application design
  • More complex networking
  • Service discovery requirements

When to Scale

Performance Indicators

Monitor these key metrics to identify scaling needs:

  • API response times exceeding thresholds
  • CPU utilization consistently above 70%
  • Memory utilization consistently above 80%
  • Request queue depth increasing
  • Database query times growing

Growth Indicators

Business metrics that suggest scaling requirements:

  • Increasing number of users
  • Growing document count
  • More concurrent sessions
  • Higher query volume
  • Additional knowledge bases

Preventative Scaling

Proactive scaling for anticipated demands:

  • Before major rollouts
  • Ahead of seasonal peaks
  • Prior to marketing campaigns
  • In advance of organizational growth

Recovery Objectives

Scaling to meet resilience targets:

  • Redundancy requirements
  • High availability goals
  • Load distribution needs
  • Geographic distribution objectives

Scaling Core Components

1

Assess Current Usage

Gather metrics on current performance and resource utilization:

# Check CPU and memory usage of pods
kubectl top pods -n prisme-system

# Review HPA metrics if configured
kubectl get hpa -n prisme-system
2

Configure HPA (Horizontal Pod Autoscaler)

Set up automatic scaling based on metrics:

# Sample HPA configuration for API
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: prisme-api
  namespace: prisme-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: prisme-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Apply the configuration:

kubectl apply -f api-hpa.yaml
3

Update Helm Values

Alternatively, configure scaling parameters in your Helm values:

# In values.yaml
api:
  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
    
worker:
  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Apply the configuration:

helm upgrade prisme-core prisme/prisme-core -f values.yaml -n prisme-system
4

Set Resource Requests and Limits

Define appropriate resource allocations:

# In values.yaml
api:
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 4Gi
      
worker:
  resources:
    requests:
      cpu: 500m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 6Gi

Set resource requests and limits based on observed usage patterns. Start conservative and adjust based on monitoring data.

5

Configure Pod Disruption Budgets

Ensure high availability during scaling:

# In values.yaml
api:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
    # or maxUnavailable: 25%

Scaling Database Components

1

Implement Replica Sets

Deploy MongoDB with replica sets for high availability and read scaling:

# MongoDB Helm values example
architecture: replicaset
replicaCount: 3
arbiter:
  enabled: false

readinessProbe:
  enabled: true
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 6
  successThreshold: 1
2

Configure Sharding

For very large deployments, implement MongoDB sharding:

  1. Set up config servers (typically 3 nodes)
  2. Deploy shard servers (multiple replica sets)
  3. Configure mongos routers
  4. Define shard keys based on data access patterns

Sharding adds complexity and should only be implemented when dataset size exceeds what a single replica set can handle efficiently.

3

Optimize Indexes

Ensure proper indexes exist for common queries:

// Execute via MongoDB shell

// For user queries
db.users.createIndex({ "email": 1 }, { unique: true })

// For document searches
db.documents.createIndex({ "title": "text", "content": "text" })
db.documents.createIndex({ "metadata.tags": 1 })

// For agent queries
db.agents.createIndex({ "workspaceId": 1, "type": 1 })
4

Scale MongoDB Resources

Increase resources for MongoDB instances:

# MongoDB Helm values
resources:
  requests:
    cpu: 2
    memory: 4Gi
  limits:
    cpu: 4
    memory: 8Gi

Scaling Storage

1

Scale Object Storage

S3 or compatible object storage typically scales automatically, but ensure proper configuration:

Performance Options

  • Enable transfer acceleration
  • Use multipart uploads for large files
  • Implement appropriate file organization
  • Consider regional deployments for global access

Cost Optimization

  • Implement lifecycle policies
  • Use appropriate storage classes
  • Enable compression where applicable
  • Monitor usage patterns
2

Scale Persistent Volumes

Adjust storage for stateful components:

# Example PVC expansion (if storage class supports it)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prisme-mongodb
  namespace: prisme-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi  # Increased from previous size
  storageClassName: standard

Not all storage classes support volume expansion. Check your cloud provider or storage system capabilities.

Scaling Infrastructure with Terraform

1

Scale Kubernetes Nodes

Adjust your node groups in Terraform:

# For AWS EKS
module "eks_node_group" {
  source = "terraform-aws-modules/eks/aws//modules/eks-managed-node-group"
  
  name = "prisme-workers"
  
  min_size     = 3
  max_size     = 10
  desired_size = 5
  
  instance_types = ["m5.xlarge", "m5.2xlarge"]
  capacity_type  = "ON_DEMAND"
  
  # ... other settings
}

# For Azure AKS
resource "azurerm_kubernetes_cluster_node_pool" "prisme_workers" {
  name                  = "prismeworkers"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.prisme.id
  vm_size               = "Standard_D4_v3"
  node_count            = 5
  min_count             = 3
  max_count             = 10
  enable_auto_scaling   = true
  
  # ... other settings
}
2

Configure Node Autoscaling

Set up cluster autoscaler for automatic node provisioning:

# For AWS EKS
resource "helm_release" "cluster_autoscaler" {
  name       = "cluster-autoscaler"
  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"
  namespace  = "kube-system"
  
  set {
    name  = "autoDiscovery.clusterName"
    value = module.eks.cluster_id
  }
  
  set {
    name  = "awsRegion"
    value = var.region
  }
  
  # ... other settings
}
3

Implement Regional Deployments

For global deployments, consider multi-region architecture:

  1. Deploy Prisme.ai in multiple regions
  2. Use global load balancing (e.g., Route53, Azure Traffic Manager)
  3. Replicate databases across regions
  4. Synchronize object storage

Monitoring for Scaling Decisions

Key Metrics to Watch

Core metrics that indicate scaling needs:

  • API response time > 200ms
  • CPU utilization > 70% sustained
  • Memory usage > 80% sustained
  • Queue depth increasing
  • Connection timeouts occurring

Monitoring Tools

Tools to implement for scaling insights:

  • Prometheus + Grafana
  • Kubernetes metrics server
  • Custom dashboards for Prisme.ai services
  • Database-specific monitoring

Alert Thresholds

Set up alerts to trigger scaling actions:

  • Warning: 60% resource utilization
  • Critical: 80% resource utilization
  • Performance degradation > 50%
  • Error rate increase > 10%

Scaling Dashboards

Create dashboards focused on scaling metrics:

  • Resource usage trends
  • Traffic patterns
  • Database performance
  • Storage growth rates

Scaling Best Practices

1

Implement Gradual Scaling

Scale resources incrementally rather than making large changes at once:

  • Increase replicas by 50-100% at a time
  • Monitor effects before further scaling
  • Allow system to stabilize between changes
  • Document performance impacts
2

Test Before Production

Validate scaling changes in non-production environments:

  • Use load testing tools (JMeter, k6, Locust)
  • Simulate real-world usage patterns
  • Test both scaling up and scaling down
  • Verify application behavior during scaling events
3

Automate Where Possible

Use automation to handle routine scaling:

  • Implement Horizontal Pod Autoscalers (HPA)
  • Configure cluster autoscaling
  • Use scheduled scaling for predictable patterns
  • Set up anomaly detection for unexpected loads
4

Document Scaling Procedures

Maintain clear documentation for scaling operations:

  • Standard operating procedures
  • Emergency scaling runbooks
  • Performance baselines
  • Historical scaling decisions and outcomes

Next Steps