Agent Testing
Validate and improve your knowledge-based agents through comprehensive testing approaches
Creating effective AI agents requires thorough testing to ensure they provide accurate, helpful, and appropriate responses. Prisme.ai provides comprehensive testing capabilities to validate your agents before deployment and continuously improve them over time.
Testing Approaches
Prisme.ai supports multiple testing methodologies to ensure your agents meet your organization’s standards:
Direct interaction with the agent to assess its responses.
Direct interaction with the agent to assess its responses.
AI-powered evaluations that assess agent responses based on predefined criteria.
Combines automated testing with human review for comprehensive evaluation.
Specialized evaluation processes implemented via Webhooks and AI Builder.
Evaluation Framework
Prisme.ai uses a straightforward evaluation system that makes it easy to assess agent performance:
Response Quality
Assesses how well the agent answers the question
Score: 0 (Poor), 1 (Adequate), 2 (Excellent)
Context Quality
Evaluates how well the agent retrieved relevant information
Score: 0 (Poor), 1 (Adequate), 2 (Excellent)
Hallucination Check
Identifies if the agent made up information
Score: 0 (Significant), 1 (Minor), 2 (None)
This simple three-point scale makes evaluation straightforward while providing meaningful insights into agent performance.
Automated Evaluation Process
The automated evaluation process uses LLMs as judges to assess agent performance:
Create Test Questions
Develop a set of representative questions that users might ask your agent.
Configure Evaluation Parameters
Set up the evaluation process by selecting:
- Which LLM will serve as the evaluator
- Evaluation frequency (daily, weekly, on-demand)
- Evaluation criteria weighting
Run Evaluations
Execute the evaluation process, either automatically on schedule or manually.
Review Results
Analyze the evaluation scores and trends over time.
The evaluation dashboard shows:
- Overall performance scores
- Performance trends over time
- Breakdowns by question type
- Detailed analysis of retrieved contexts
Export and Share
Export test sets and results for documentation, sharing, or further analysis.
Human-in-the-Loop Evaluation
Combine automated testing with human expertise for comprehensive quality control:
Human reviewers can:
- Review and override automated evaluation scores
- Provide qualitative feedback on responses
- Identify subtle issues that automated systems miss
- Add new test questions based on emerging needs
- Validate context quality and relevance
Custom Evaluation with Webhooks
For specialized evaluation needs, you can implement custom processes using Webhooks and AI Builder:
Configure Webhook Endpoint
Set up a Webhook URL that will listen for test events.
Implement Custom Evaluation Logic
Create evaluation logic that processes test results according to your specific criteria.
Custom evaluations can include:
- Domain-specific quality metrics
- Compliance and regulatory checks
- Industry terminology validation
- Integration with existing quality systems
Return Standardized Results
Send evaluation results back to Prisme.ai in the standard scoring format.
Strategic Benefits of Testing
Comprehensive testing delivers significant benefits beyond simple quality control:
Monitor Data Source Changes
Detect when changes to underlying data sources affect response quality.
This allows you to:
- Prevent regressions when content is updated
- Identify when knowledge gaps emerge
- Maintain consistency across content updates
Optimize LLM Selection
Evaluate performance across different LLM providers and models.
This enables you to:
- Select more cost-efficient models
- Reduce energy consumption
- Use specialized or self-hosted models when appropriate
- Make data-driven model migration decisions
Engage Business Stakeholders
Foster ownership of content quality among domain experts.
This helps to:
- Demonstrate the impact of quality source material
- Create accountability for knowledge accuracy
- Build trust in AI system outputs
- Drive continuous content improvement
Establish Tech-Business Alignment
Create a shared understanding of performance metrics and goals.
This leads to:
- Clear performance contracts between teams
- Shared optimization targets
- Better resource allocation
- Transparent communication about capabilities
Testing Methodology: Start Simple
We recommend an iterative testing approach that builds from foundational tests to more complex scenarios:
Initial Test Set (15 Questions)
Start with a manageable set of diverse test cases:
Basic factual queries with straightforward answers.
Examples:
- “What is our company’s return policy?”
- “Who is the contact person for technical support?”
- “What are the operating hours for customer service?”
Purpose: Establish a baseline for core knowledge retrieval.
Basic factual queries with straightforward answers.
Examples:
- “What is our company’s return policy?”
- “Who is the contact person for technical support?”
- “What are the operating hours for customer service?”
Purpose: Establish a baseline for core knowledge retrieval.
Queries requiring some synthesis or comparison.
Examples:
- “How do our Standard and Premium plans differ?”
- “What steps should I take if a customer requests a refund after 30 days?”
- “Explain the main benefits of our latest product update.”
Purpose: Test the agent’s ability to connect related information.
Multi-part or nuanced queries requiring deeper understanding.
Examples:
- “What are the tradeoffs between our cloud and on-premises deployment options for enterprise customers with strict data residency requirements?”
- “How have our sustainability initiatives impacted our manufacturing costs and product pricing over the past three years?”
- “What are the recommended approaches for implementing our API in a high-throughput environment with legacy system integration?”
Purpose: Challenge the agent’s advanced capabilities.
Iterative Optimization
After initial testing, systematically adjust and retest to improve performance:
Adjust LLM Parameters
Experiment with:
- Prompt engineering adjustments
- Temperature and creativity settings
- Different models or model versions
Refine RAG Configuration
Optimize how information is processed and retrieved:
- Chunking strategies
- Indexing methods
- Retrieval mechanisms
- Context handling
Integrate Tools
Add specialized capabilities where needed:
- Calculators for numerical questions
- Structured data tools for comparisons
- Visualization tools for complex data
Expand Test Set
Once performance is optimized, increase test coverage:
- Add more edge cases
- Include newly discovered user questions
- Create tests for specific user personas
Best Practices
Next Steps
Was this page helpful?