This sizing depends on your specific use case, whether it’s full automation or chat mode. We strongly recommend conducting your own load testing tailored to your specific infrastructure and use cases.

Target Performance Metrics

User Interactions

4-10 per user

Average number of interactions each user makes with the platform

First Token Response

478ms (P95)

Time to first token from LLM API (using OpenAI as reference)

Concurrent Users

100 new users/second

Platform should handle 100 new users each second under peak load

Infrastructure Components

Kubernetes Cluster

Node Configuration: 5 nodes with 8GB RAM and 4 vCPU each

Storage Systems

Capacity: 50GB Elastic File System for shared storage

Configuration: Can be shared between different environments or isolated for each

Databases

Data Types: RBAC permissions, users, application data

Configuration:

  • 3 nodes in replica set
  • 2GB RAM and 2 vCPU per node
  • 1,000 IOPS

Disk Space: 10GB total storage requirement

Environment Separation:

  • 1 “permissions” database per environment
  • 1 “users” database per environment
  • 1 “collections” database per environment

Version: MongoDB version 6 with path to version 7

The cluster can be shared across environments with proper database separation.

Scaling Considerations

Monitoring Recommendations

We recommend monitoring the following metrics to ensure optimal performance:

System Metrics

  • CPU utilization
  • Memory usage
  • Disk I/O and latency
  • Network throughput

Application Metrics

  • Request latency
  • Error rates
  • Concurrent users
  • Queue lengths

Database Metrics

  • Query performance
  • Connection pool usage
  • Index efficiency
  • Replication lag

These sizing recommendations provide a starting point, but real-world performance may vary. Always conduct load testing with scenarios that reflect your actual usage patterns.