Skip to main content
This sizing depends on your specific use case, whether it’s full automation or chat mode. We strongly recommend conducting your own load testing tailored to your specific infrastructure and use cases.

Target Performance Metrics

User Interactions

4-10 per userAverage number of interactions each user makes with the platform

First Token Response

478ms (P95)Time to first token from LLM API (using OpenAI as reference)

Concurrent Users

100 new users/secondPlatform should handle 100 new users each second under peak load

Infrastructure Components

Kubernetes Cluster

Node Configuration: 5 nodes with 8GB RAM and 4 vCPU each

Storage Systems

  • EFS
  • Object Storage
Capacity: 50GB Elastic File System for shared storageConfiguration: Can be shared between different environments or isolated for each

Databases

  • MongoDB/PostgreSQL
  • Redis Cache & Broker
  • Redis Crawler
  • Elasticsearch/OpenSearch
  • Vector Database
Data Types: RBAC permissions, users, application dataConfiguration:
  • 3 nodes in replica set
  • 2GB RAM and 2 vCPU per node
  • 1,000 IOPS
Disk Space: 10GB total storage requirementEnvironment Separation:
  • 1 “permissions” database per environment
  • 1 “users” database per environment
  • 1 “collections” database per environment
Version:
  • MongoDB version 6 with path to version 7
  • Or PostgreSQL >=10
The cluster can be shared across environments with proper database separation.

Scaling Considerations

When planning your infrastructure, consider these scaling approaches:
  • Horizontal Scaling: Add more nodes to distribute load. Recommended for Kubernetes nodes and database replicas.
  • Vertical Scaling: Increase resources (RAM, CPU) on existing nodes. Useful for temporary peaks or when hitting connection limits.
For most production deployments, we recommend a combination of both approaches with an emphasis on horizontal scaling for better resilience.
When allocating resources across components:
  1. Start with our baseline recommendations
  2. Monitor resource utilization during testing
  3. Identify bottlenecks (usually memory or disk I/O)
  4. Scale the constrained resources first before adding more nodes
Vector databases and Elasticsearch typically benefit most from additional memory, while Redis is often constrained by CPU.
For multi-environment deployments:
  • Development: Can share infrastructure with minimal isolation
  • Staging: Should mimic production but can use smaller resources
  • Production: Requires dedicated resources and stricter isolation
We recommend full cluster-level isolation for production, while development and staging can share some database clusters with proper database-level separation.

Monitoring Recommendations

We recommend monitoring the following metrics to ensure optimal performance:

System Metrics

  • CPU utilization
  • Memory usage
  • Disk I/O and latency
  • Network throughput

Application Metrics

  • Request latency
  • Error rates
  • Concurrent users
  • Queue lengths

Database Metrics

  • Query performance
  • Connection pool usage
  • Index efficiency
  • Replication lag
These sizing recommendations provide a starting point, but real-world performance may vary. Always conduct load testing with scenarios that reflect your actual usage patterns.
I