Two reference variants
We ship two sets of resource defaults. Treat them as the recommended lower and upper bounds for production — your environment should land somewhere between the two, tuned to your actual load and use case.| Variant | Use when |
|---|---|
| Balanced (chart default) | Up to a few hundred concurrent users, budget-sensitive. Each pod is sized for typical load, with HPA absorbing spikes. |
| High-throughput | Thousands of concurrent users, max throughput per pod. Requests are sized at the maximum each service can actually use. |
Start at Balanced.
Balanced is now the chart default and the right starting point for a fresh install — high enough to run a real workload comfortably, low enough to avoid over-paying before you know your traffic shape. Move toward High-throughput (or somewhere in between) once you’ve observed your actual usage.Trade-off in plain English
- High requests → fewer pods per node → idle CPU/RAM is wasted when pods aren’t busy.
- Low requests → more pods per node → risk of a node being unable to provide CPU/RAM under load spikes.
Right-sizing from observed usage
Once the platform is live, monitor per-pod CPU and memory and resize from data, not from these tables:- CPU request → set to the P95 of the pod’s normal usage. CPU is compressible — the occasional P99 spike is absorbed by the node without breaking the workload, so paying for it as a permanent request wastes capacity.
- Memory request → set to the P99 of the pod’s normal usage. Memory is not compressible — exceeding the request risks eviction or OOMKill, so the request must comfortably cover the worst legitimate spike.
- Memory limit → keep close to the request (the chart defaults already do this) so a runaway pod is capped before it OOMs the node.
- CPU limit → don’t set one (see Why no CPU limit? below).
Cluster size baseline
The requests / limits below add up to the following node count when scheduled on 16 GB RAM, 4 vCPU worker nodes (a common cloud baseline like AWSm6i.xlarge, GCP n2-standard-4, Azure D4s_v5):
| Variant | Starting nodes | Notes |
|---|---|---|
| Balanced | 5 nodes (16 GB / 4 vCPU each) | Covers ~8 vCPU / ~14 GB of steady-state requests for core + apps, with HPA headroom up to ~20 vCPU / ~40 GB at max replicas. |
| High-throughput | 8 nodes (16 GB / 4 vCPU each) | Larger per-pod requests + higher HPA ceilings. Enable the cluster autoscaler so the pool grows past 8 under sustained spikes. |
Core namespace
Variant A — Balanced
| Service | CPU request | Memory request | Memory limit | HPA min / max |
|---|---|---|---|---|
| api-gateway | 300m | 200Mi | 300Mi | 2 / 4 |
| workspaces | 300m | 200Mi | 400Mi | 2 / 4 |
| runtime (3 workers) | 2000m | 2500Mi | 2500Mi | 2 / 4 |
| events | 500m | 300Mi | 500Mi | 2 / 4 |
| console | 100m | 300Mi | 300Mi | 2 / 4 |
Variant B — High-throughput
| Service | CPU request | Memory request | Memory limit | HPA min / max |
|---|---|---|---|---|
| api-gateway | 2000m | 600Mi | 600Mi | 2 / 5 |
| workspaces | 1250m | 500Mi | 500Mi | 2 / 5 |
| runtime (4 workers) | 3800m | 4000Mi | 4Gi | 2 / 5 |
| events | 1500m | 600Mi | 600Mi | 2 / 5 |
| console | 1000m | 300Mi | 300Mi | 2 / 5 |
Apps namespace
Variant A — Balanced
| Service | CPU request | Memory request | Memory limit | HPA min / max |
|---|---|---|---|---|
| functions | 400m | 1.5Gi | 2Gi | 1 / 6 |
| crawler | 1500m | 3Gi | 3Gi | 1 / 4 |
| searchengine | 100m | 200Mi | 400Mi | 1 / 4 |
Variant B — High-throughput
| Service | CPU request | Memory request | Memory limit | HPA min / max |
|---|---|---|---|---|
| functions | 1500m | 1500Mi | 3Gi | 2 / 5 |
| crawler | 3000m | 5Gi | 5Gi | 2 / 5 |
| searchengine | 1500m | 500Mi | 500Mi | 2 / 5 |
Why no CPU limit?
You’ll notice the tables above have no CPU limit. Don’t set one on Prisme.ai services.- The Linux CFS scheduler already throttles pods that exceed their request when another pod on the same node needs that CPU.
- A CPU limit on top of that wastes free CPU: when the node has spare cycles and a pod could use them, the limit blocks it for no reason.
- In practice, CPU limits on Node.js workloads cause unnecessary tail-latency spikes and add zero protection.
Horizontal Pod Autoscaling (HPA)
Golden rule for production: at least 2 replicas of every core service, and offunctions and crawler in apps. Single-replica means a single bad pod, restart or node drain takes the service down.
The
HPA max values in the variant tables above are starting points, not ceilings. They were chosen to cover the typical load profile of each variant — review them against your own observed traffic and raise them when you see HPA sitting at max replicas during peaks or latency degrading. The right max replicas count is the one that comfortably absorbs your highest expected spike with headroom; treat the defaults as a starting baseline to tune.Example HPA config
Recommended targets
| Service group | Target CPU | Notes |
|---|---|---|
| api-gateway, workspaces, runtime, console | 70% | Standard CPU-bound services |
| events | 70% CPU | Pin clients via a consistent-hash header (or Istio DestinationRule on user-agent) — events benefit from sticky sessions |
| functions | 70% CPU | Memory is the real constraint; over-provision memory before scaling |
| crawler | 80% CPU + 80% memory | Heavy memory pressure during indexing |
| searchengine | 70% CPU |
Scaling prismeai-runtime workers
prismeai-runtime is the only platform service that runs multiple workers inside the same pod. The pod runs a main thread that receives HTTP calls from api-gateway and events from the broker and dispatches them to N worker threads which actually execute the automations. Adding workers raises the per-pod parallelism without spinning up new pods.
Configured via the Helm value prismeai-runtime.maxWorkers (which populates the RUNNER_MAX_THREADS env var inside the container). Default: 1.
The
runtime resource numbers in the variant tables above are sized for a specific worker count: 3 workers for Balanced, 4 workers for High-throughput. If you change maxWorkers, re-derive the pod resources with the math below — keeping the defaults while bumping workers will starve each one.Golden rule: scale workers and resources together
A worker is a Node.js thread — it competes for the pod’s CPU and memory budget alongside the main thread. DoublingmaxWorkers without raising resources.requests / resources.limits halves what each worker gets and you’ll see latency degrade or workers OOMKilled.
The chart slices 75% of resources.limits.memory across the workers (the other 25% goes to V8 metadata, native modules and the main thread). So sizing the pod is a two-step decision:
- Pick how much heap each worker needs based on the workload it will execute. For production workloads, plan 500 MiB to 1 GiB per worker depending on the automations they run (heavier templating, large payloads or heavy in-flight contexts push it toward 1 GiB).
- Set
limits.memoryandmaxWorkerstogether so the chart’s auto-slice gives each worker that heap.
- Top-down — keep
limits.memoryfixed and lowermaxWorkers: each worker’s heap rises proportionally. - Bottom-up — target a per-worker heap and raise
limits.memoryto fitmaxWorkersof them, plus the 25% main-thread overhead.
limits.memory ≈ 4 GiB, maxWorkers: 4 (this is the High-throughput default). For CPU, budget at least one core per worker plus a small share for the main thread (e.g. requests.cpu: "4.5" for 4 workers).
Node.js workers heap is auto-sized
Each worker thread has its own heap memory hard limit, configured via Node.js’s--max-old-space-size option. You don’t need to set NODE_OPTIONS manually — the chart auto-derives each worker’s heap from resources.limits.memory and maxWorkers:
Resource Quotas and Limit Ranges
Apply quotas at the namespace level to prevent runaway consumption.Troubleshooting
Pods stuck Pending
Pods stuck Pending
- “0/N nodes available: Insufficient cpu/memory” → reduce requests, increase node size, or scale the node pool.
- “node(s) didn’t match Pod’s node affinity/selector” → check
tolerationsandnodeSelectoragainst your node pool labels.
HPA pinned at maxReplicas
HPA pinned at maxReplicas
- Sustained 100% CPU on max replicas → raise
maxReplicas, or move from Balanced to High-throughput variant for the affected service. - Memory-bound service (e.g.
functions) → raise memory request rather than replica count.
Pod OOMKilled
Pod OOMKilled
- For
runtime, checkBROKER_EMIT_MAXLEN— very large events can spike memory. - For
functions, checkNODE_OPTIONS=--max-old-space-size=...matches the container memory limit. - Raise the memory request and limit together.
CPU throttling under load
CPU throttling under load
- High CPU usage approaching the request, with growing latency → raise the CPU request (not a limit).
- Verify no CPU limit is set on Prisme.ai services.
Node saturation
Node saturation
- One node carrying disproportionate load → check pod anti-affinity on critical services, or rebalance with a rolling restart.
- Persistent saturation → scale the node pool or move to larger instances.
Frequent restarts after deployments
Frequent restarts after deployments
- Failing readiness probe right after rollout → check service dependencies (DB reachable, secrets mounted).
- Crash loop →
kubectl logs <pod> -n <namespace> --previous. - If the regression is in a new chart version, roll back:
helm rollback <release> <revision> -n <namespace>— see Helm install — Upgrade, rollback, uninstall.
Volume node affinity conflict
Volume node affinity conflict
Symptom in A pod with a
kubectl get events:ReadWriteOnce PVC can only be scheduled on the node where the volume lives. If that node is full and the others can’t mount the volume, scheduling fails.Fixes:- Free CPU/memory on the node that hosts the volume (delete idle pods, scale a noisy neighbor down).
- Move to a ReadWriteMany storage class (EFS / Azure Files / Filestore / CephFS) for workloads that can be re-scheduled freely — see Requirements.
- Add a node to the pool.
PersistentVolume nearly full
PersistentVolume nearly full
Alerts:
KubePersistentVolumeFillingUp, KubePersistentVolumeFull.- For Elasticsearch or OpenSearch, the most common cause is unbounded event growth — enable the
prismeai-eventscleanup job and tune retention. See Elasticsearch — Events automated cleanup. - Expand the PVC if the storage class supports it (
kubectl edit pvc <pvc>→ bumpspec.resources.requests.storage). - Otherwise, snapshot + restore into a larger volume.
Next Steps
Helm install
Configure values and deploy core + apps namespaces.
Databases
PostgreSQL or MongoDB, Redis, Elasticsearch or OpenSearch.
Install products
Configure your Prisme.ai AI products.
Operations
Scaling, updates and backups.