Kubernetes Resources & Autoscaling

This page tells you how much CPU and memory to give each Prisme.ai service, when to scale, and how to debug resource problems. Database resources are covered in Databases.

Two reference variants

We ship two sets of resource defaults. Treat them as the recommended lower and upper bounds for production — your environment should land somewhere between the two, tuned to your actual load and use case.

Variant	Use when
Balanced (chart default)	Up to a few hundred concurrent users, budget-sensitive. Each pod is sized for typical load, with HPA absorbing spikes.
Performance	Thousands of concurrent users, max throughput per pod. Requests are sized at the maximum each service can actually use.

Start at Balanced. Balanced is now the chart default and the right starting point for a fresh install — high enough to run a real workload comfortably, low enough to avoid over-paying before you know your traffic shape. Move toward Performance (or somewhere in between) once you’ve observed your actual usage.

Trade-off in plain English

High requests → fewer pods per node → idle CPU/RAM is wasted when pods aren’t busy.
Low requests → more pods per node → risk of a node being unable to provide CPU/RAM under load spikes.

Pick Balanced if your traffic is mostly steady and budget matters. Pick Performance if you need each pod to absorb large bursts on its own. Most production deployments end up at an intermediate point — there is no reason to copy either column verbatim once you have real data.

Right-sizing from observed usage

Once the platform is live, monitor per-pod CPU and memory and resize from data, not from these tables:

CPU request → set to the P95 of the pod’s normal usage. CPU is compressible — the occasional P99 spike is absorbed by the node without breaking the workload, so paying for it as a permanent request wastes capacity.
Memory request → set to the P99 of the pod’s normal usage. Memory is not compressible — exceeding the request risks eviction or OOMKill, so the request must comfortably cover the worst legitimate spike.
Memory limit → keep close to the request (the chart defaults already do this) so a runaway pod is capped before it OOMs the node.
CPU limit → don’t set one (see Why no CPU limit? below).

Iterate: re-measure after a few weeks of steady traffic and after any major usage shift (new product enabled, big workspace migration, traffic doubled).

Cluster size baseline

The requests / limits below add up to the following node count when scheduled on 16 GB RAM, 4 vCPU worker nodes (a common cloud baseline like AWS m6i.xlarge, GCP n2-standard-4, Azure D4s_v5):

Variant	Starting nodes	Notes
Balanced	5 nodes (16 GB / 4 vCPU each)	Covers ~8 vCPU / ~14 GB of steady-state requests for `core` + `apps`, with HPA headroom up to ~20 vCPU / ~40 GB at max replicas.
Performance	8 nodes (16 GB / 4 vCPU each)	Larger per-pod requests + higher HPA ceilings. Enable the cluster autoscaler so the pool grows past 8 under sustained spikes.

These are starting points — the actual node count depends on your HPA targets, the database pods you co-locate, and any extra workloads. Rely on the cluster autoscaler in both cases.

Core namespace

Variant A — Balanced

Service	CPU request	Memory request	Memory limit	HPA min / max
api-gateway	300m	200Mi	300Mi	2 / 4
workspaces	300m	200Mi	400Mi	2 / 4
runtime (3 workers)	2000m	2500Mi	2500Mi	2 / 4
events	500m	300Mi	500Mi	2 / 4
console	100m	300Mi	300Mi	2 / 4

Variant B — Performance

Service	CPU request	Memory request	Memory limit	HPA min / max
api-gateway	2000m	600Mi	600Mi	2 / 5
workspaces	1250m	500Mi	500Mi	2 / 5
runtime (4 workers)	3800m	4000Mi	4Gi	2 / 5
events	1500m	600Mi	600Mi	2 / 5
console	1000m	300Mi	300Mi	2 / 5

Apps namespace

Variant A — Balanced

Service	CPU request	Memory request	Memory limit	HPA min / max
functions	400m	1.5Gi	2Gi	1 / 6
crawler	1500m	3Gi	3Gi	1 / 4
searchengine	100m	200Mi	400Mi	1 / 4

Variant B — Performance

Service	CPU request	Memory request	Memory limit	HPA min / max
functions	1500m	1500Mi	3Gi	2 / 5
crawler	3000m	5Gi	5Gi	2 / 5
searchengine	1500m	500Mi	500Mi	2 / 5

Why no CPU limit?

You’ll notice the tables above have no CPU limit. Don’t set one on Prisme.ai services.

The Linux CFS scheduler already throttles pods that exceed their request when another pod on the same node needs that CPU.
A CPU limit on top of that wastes free CPU: when the node has spare cycles and a pod could use them, the limit blocks it for no reason.
In practice, CPU limits on Node.js workloads cause unnecessary tail-latency spikes and add zero protection.

Memory limits stay — memory is not compressible, so a runaway pod must be capped before it triggers OOM on the whole node.

Horizontal Pod Autoscaling (HPA)

Golden rule for production: at least 2 replicas of every core service, and of functions and crawler in apps. Single-replica means a single bad pod, restart or node drain takes the service down.

The HPA max values in the variant tables above are starting points, not ceilings. They were chosen to cover the typical load profile of each variant — review them against your own observed traffic and raise them when you see HPA sitting at max replicas during peaks or latency degrading. The right max replicas count is the one that comfortably absorbs your highest expected spike with headroom; treat the defaults as a starting baseline to tune.

Example HPA config

prismeai-runtime:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 5
    targetCPUUtilizationPercentage: 70

Recommended targets

Service group	Target CPU	Notes
api-gateway, workspaces, runtime, console	70%	Standard CPU-bound services
events	70% CPU	Pin clients via a consistent-hash header (or Istio `DestinationRule` on `user-agent`) — events benefit from sticky sessions
functions	70% CPU	Memory is the real constraint; over-provision memory before scaling
crawler	80% CPU + 80% memory	Heavy memory pressure during indexing
searchengine	70% CPU

Scaling `prismeai-runtime` workers

prismeai-runtime is the only platform service that runs multiple workers inside the same pod. The pod runs a main thread that receives HTTP calls from api-gateway and events from the broker and dispatches them to N worker threads which actually execute the automations. Adding workers raises the per-pod parallelism without spinning up new pods. Configured via the Helm value prismeai-runtime.maxWorkers (which populates the RUNNER_MAX_THREADS env var inside the container). Default: 1.

The runtime resource numbers in the variant tables above are sized for a specific worker count: 3 workers for Balanced, 4 workers for Performance. If you change maxWorkers, re-derive the pod resources with the math below — keeping the defaults while bumping workers will starve each one.

Golden rule: scale workers and resources together

A worker is a Node.js thread — it competes for the pod’s CPU and memory budget alongside the main thread. Doubling maxWorkers without raising resources.requests / resources.limits halves what each worker gets and you’ll see latency degrade or workers OOMKilled. The chart slices 75% of resources.limits.memory across the workers (the other 25% goes to V8 metadata, native modules and the main thread). So sizing the pod is a two-step decision:

Pick how much heap each worker needs based on the workload it will execute. For production workloads, plan 500 MiB to 1 GiB per worker depending on the automations they run (heavier templating, large payloads or heavy in-flight contexts push it toward 1 GiB).
Set limits.memory and maxWorkers together so the chart’s auto-slice gives each worker that heap.

In practice you have two equivalent ways to think about it:

Top-down — keep limits.memory fixed and lower maxWorkers: each worker’s heap rises proportionally.
Bottom-up — target a per-worker heap and raise limits.memory to fit maxWorkers of them, plus the 25% main-thread overhead.

Example: aiming for 4 workers at ~750 MiB heap each → limits.memory ≈ 4 GiB, maxWorkers: 4 (this is the Performance default). For CPU, budget at least one core per worker plus a small share for the main thread (e.g. requests.cpu: "4.5" for 4 workers).

Node.js workers heap is auto-sized

Each worker thread has its own heap memory hard limit, configured via Node.js’s --max-old-space-size option. You don’t need to set NODE_OPTIONS manually — the chart auto-derives each worker’s heap from resources.limits.memory and maxWorkers:

maxOldSpaceSize = floor( memory_limit_in_Mi / maxWorkers × 0.75 )

Resource Quotas and Limit Ranges

Apply quotas at the namespace level to prevent runaway consumption.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: prismeai-core-quota
  namespace: prismeai-core
spec:
  hard:
    requests.cpu: "16"
    requests.memory: "32Gi"
    limits.memory: "64Gi"
    pods: "50"

apiVersion: v1
kind: LimitRange
metadata:
  name: prismeai-core-limits
  namespace: prismeai-core
spec:
  limits:
    - type: Container
      defaultRequest:
        cpu: "250m"
        memory: "512Mi"
      default:
        memory: "1Gi"

Troubleshooting

Pods stuck Pending

kubectl describe pod <pod> -n <namespace> | tail -30
kubectl describe nodes | grep -A 5 "Allocated resources"

“0/N nodes available: Insufficient cpu/memory” → reduce requests, increase node size, or scale the node pool.
“node(s) didn’t match Pod’s node affinity/selector” → check tolerations and nodeSelector against your node pool labels.

HPA pinned at maxReplicas

kubectl get hpa -n <namespace>
kubectl describe hpa/<hpa> -n <namespace>

Sustained 100% CPU on max replicas → raise maxReplicas, or move from Balanced to Performance variant for the affected service.
Memory-bound service (e.g. functions) → raise memory request rather than replica count.

Pod OOMKilled

kubectl logs <pod> -n <namespace> --previous
kubectl describe pod <pod> -n <namespace> | grep -i oom

For runtime, check BROKER_EMIT_MAXLEN — very large events can spike memory.
For functions, check NODE_OPTIONS=--max-old-space-size=... matches the container memory limit.
Raise the memory request and limit together.

CPU throttling under load

kubectl top pods -n <namespace>

High CPU usage approaching the request, with growing latency → raise the CPU request (not a limit).
Verify no CPU limit is set on Prisme.ai services.

Node saturation

kubectl top nodes
kubectl describe nodes/<node>

One node carrying disproportionate load → check pod anti-affinity on critical services, or rebalance with a rolling restart.
Persistent saturation → scale the node pool or move to larger instances.

Frequent restarts after deployments

kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -30

Failing readiness probe right after rollout → check service dependencies (DB reachable, secrets mounted).
Crash loop → kubectl logs <pod> -n <namespace> --previous.
If the regression is in a new chart version, roll back: helm rollback <release> <revision> -n <namespace> — see Helm install — Upgrade, rollback, uninstall.

Volume node affinity conflict

Symptom in kubectl get events:

FailedScheduling   0/3 nodes are available: 1 Insufficient memory, 2 node(s) had volume node affinity conflict.

A pod with a ReadWriteOnce PVC can only be scheduled on the node where the volume lives. If that node is full and the others can’t mount the volume, scheduling fails.Fixes:

Free CPU/memory on the node that hosts the volume (delete idle pods, scale a noisy neighbor down).
Move to a ReadWriteMany storage class (EFS / Azure Files / Filestore / CephFS) for workloads that can be re-scheduled freely — see Requirements.
Add a node to the pool.

PersistentVolume nearly full

Alerts: KubePersistentVolumeFillingUp, KubePersistentVolumeFull.

kubectl get pvc -A
kubectl describe pvc <pvc> -n <namespace>

For Elasticsearch or OpenSearch, the most common cause is unbounded event growth — enable the prismeai-events cleanup job and tune retention. See Elasticsearch — Events automated cleanup.
Expand the PVC if the storage class supports it (kubectl edit pvc <pvc> → bump spec.resources.requests.storage).
Otherwise, snapshot + restore into a larger volume.

Next Steps

Helm install

Configure values and deploy core + apps namespaces.

Databases

PostgreSQL or MongoDB, Redis, Elasticsearch or OpenSearch.

Install products

Configure your Prisme.ai AI products.

Operations

Scaling, updates and backups.

Overview

Installation

Databases

Entreprise Services

Operations

Kubernetes Resources & Autoscaling

Two reference variants

Trade-off in plain English

Right-sizing from observed usage

Cluster size baseline

Core namespace

Variant A — Balanced

Variant B — Performance

Apps namespace

Variant A — Balanced

Variant B — Performance

Why no CPU limit?

Horizontal Pod Autoscaling (HPA)

Example HPA config

Recommended targets

Scaling `prismeai-runtime` workers

Golden rule: scale workers and resources together

Node.js workers heap is auto-sized

Resource Quotas and Limit Ranges

Troubleshooting

Next Steps

Helm install

Databases

Install products

Operations

​Two reference variants

​Trade-off in plain English

​Right-sizing from observed usage

​Cluster size baseline

​Core namespace

​Variant A — Balanced

​Variant B — Performance

​Apps namespace

​Variant A — Balanced

​Variant B — Performance

​Why no CPU limit?

​Horizontal Pod Autoscaling (HPA)

​Example HPA config

​Recommended targets

​Scaling prismeai-runtime workers

​Golden rule: scale workers and resources together

​Node.js workers heap is auto-sized

​Resource Quotas and Limit Ranges

​Troubleshooting

​Next Steps

Helm install

Databases

Install products

Operations

Two reference variants

Trade-off in plain English

Right-sizing from observed usage

Cluster size baseline

Core namespace

Variant A — Balanced

Variant B — Performance

Apps namespace

Variant A — Balanced

Variant B — Performance

Why no CPU limit?

Horizontal Pod Autoscaling (HPA)

Example HPA config

Recommended targets

Scaling `prismeai-runtime` workers

Golden rule: scale workers and resources together

Node.js workers heap is auto-sized

Resource Quotas and Limit Ranges

Troubleshooting

Next Steps