Creating effective AI agents requires thorough testing to ensure they provide accurate, helpful, and appropriate responses. Prisme.ai provides comprehensive testing capabilities to validate your agents before deployment and continuously improve them over time.

Testing Approaches

Prisme.ai supports multiple testing methodologies to ensure your agents meet your organization’s standards:

Direct interaction with the agent to assess its responses.

Evaluation Framework

Prisme.ai uses a straightforward evaluation system that makes it easy to assess agent performance:

Response Quality

Assesses how well the agent answers the question

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Context Quality

Evaluates how well the agent retrieved relevant information

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Hallucination Check

Identifies if the agent made up information

Score: 0 (Significant), 1 (Minor), 2 (None)

This simple three-point scale makes evaluation straightforward while providing meaningful insights into agent performance.

Automated Evaluation Process

The automated evaluation process uses LLMs as judges to assess agent performance:

1

Create Test Questions

Develop a set of representative questions that users might ask your agent.

2

Configure Evaluation Parameters

Set up the evaluation process by selecting:

  • Which LLM will serve as the evaluator
  • Evaluation frequency (daily, weekly, on-demand)
  • Evaluation criteria weighting
3

Run Evaluations

Execute the evaluation process, either automatically on schedule or manually.

4

Review Results

Analyze the evaluation scores and trends over time.

The evaluation dashboard shows:

  • Overall performance scores
  • Performance trends over time
  • Breakdowns by question type
  • Detailed analysis of retrieved contexts
5

Export and Share

Export test sets and results for documentation, sharing, or further analysis.

Human-in-the-Loop Evaluation

Combine automated testing with human expertise for comprehensive quality control:

Human reviewers can:

  • Review and override automated evaluation scores
  • Provide qualitative feedback on responses
  • Identify subtle issues that automated systems miss
  • Add new test questions based on emerging needs
  • Validate context quality and relevance

Custom Evaluation with Webhooks

For specialized evaluation needs, you can implement custom processes using Webhooks and AI Builder:

1

Configure Webhook Endpoint

Set up a Webhook URL that will listen for test events.

{
  "webhook_url": "https://your-custom-evaluator.com/api/evaluate",
  "event_types": ["test_execution", "test_result"],
  "authentication": {
    "type": "bearer_token",
    "token": "${ENV_SECRET_TOKEN}"
  }
}
2

Implement Custom Evaluation Logic

Create evaluation logic that processes test results according to your specific criteria.

Custom evaluations can include:

  • Domain-specific quality metrics
  • Compliance and regulatory checks
  • Industry terminology validation
  • Integration with existing quality systems
3

Return Standardized Results

Send evaluation results back to Prisme.ai in the standard scoring format.

{
  "test_id": "test-123",
  "scores": {
    "response_quality": 2,
    "context_quality": 1,
    "hallucination": 2
  },
  "feedback": "Response was accurate but missing some context about recent policy changes.",
  "custom_metrics": {
    "compliance_score": 0.95,
    "terminology_accuracy": 0.87
  }
}

Strategic Benefits of Testing

Comprehensive testing delivers significant benefits beyond simple quality control:

Monitor Data Source Changes

Detect when changes to underlying data sources affect response quality.

This allows you to:

  • Prevent regressions when content is updated
  • Identify when knowledge gaps emerge
  • Maintain consistency across content updates

Optimize LLM Selection

Evaluate performance across different LLM providers and models.

This enables you to:

  • Select more cost-efficient models
  • Reduce energy consumption
  • Use specialized or self-hosted models when appropriate
  • Make data-driven model migration decisions

Engage Business Stakeholders

Foster ownership of content quality among domain experts.

This helps to:

  • Demonstrate the impact of quality source material
  • Create accountability for knowledge accuracy
  • Build trust in AI system outputs
  • Drive continuous content improvement

Establish Tech-Business Alignment

Create a shared understanding of performance metrics and goals.

This leads to:

  • Clear performance contracts between teams
  • Shared optimization targets
  • Better resource allocation
  • Transparent communication about capabilities

Testing Methodology: Start Simple

We recommend an iterative testing approach that builds from foundational tests to more complex scenarios:

Initial Test Set (15 Questions)

Start with a manageable set of diverse test cases:

Basic factual queries with straightforward answers.

Examples:

  • “What is our company’s return policy?”
  • “Who is the contact person for technical support?”
  • “What are the operating hours for customer service?”

Purpose: Establish a baseline for core knowledge retrieval.

Iterative Optimization

After initial testing, systematically adjust and retest to improve performance:

1

Adjust LLM Parameters

Experiment with:

  • Prompt engineering adjustments
  • Temperature and creativity settings
  • Different models or model versions
2

Refine RAG Configuration

Optimize how information is processed and retrieved:

  • Chunking strategies
  • Indexing methods
  • Retrieval mechanisms
  • Context handling
3

Integrate Tools

Add specialized capabilities where needed:

  • Calculators for numerical questions
  • Structured data tools for comparisons
  • Visualization tools for complex data
4

Expand Test Set

Once performance is optimized, increase test coverage:

  • Add more edge cases
  • Include newly discovered user questions
  • Create tests for specific user personas

Best Practices

Next Steps