The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.

Overview

The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:

prismeai-crawler

Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction

prismeai-searchengine

Indexes processed content and provides search capabilities with relevance ranking and content highlighting

These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.

Installation Prerequisites

Before deploying these microservices, ensure you have access to the following dependencies:

ElasticSearch

Required for document storage and search functionality

  • Can use the same ElasticSearch instance as the core deployment
  • Stores indexed content and search metadata
  • Provides the search functionality backend

Redis

Required for inter-service communication

  • Can use the same Redis instance as the core deployment
  • Manages crawl queues and job scheduling
  • Facilitates communication between services
  • Stores temporary processing data

Configuration

Environment Variables

Configure the Crawler and SearchEngine microservices with the following environment variables:

ELASTIC_SEARCH_URL can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.

Resource Considerations

When planning your deployment, consider these resource recommendations:

Memory Requirements

  • Crawler: Min 1GB, recommended 2GB+
  • SearchEngine: Min 1GB, recommended 2GB+
  • Scale based on crawl volume and index size

CPU Allocation

  • Crawler: Min 0.5 vCPU, recommended 1+ vCPU
  • SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
  • Consider additional resources for high request volumes

Storage Needs

  • ElasticSearch: Plan for index growth based on content volume
  • Redis: Minimal requirements for queue management
  • Consider storage class with good I/O performance

Network Configuration

  • Internet access for the Crawler service
  • Internal network access between services
  • Consider bandwidth requirements for crawl activities

Deployment Process

Follow these steps to deploy the Crawler and SearchEngine microservices:

1

Configure Dependencies

Ensure ElasticSearch and Redis are accessible:

  1. Verify ElasticSearch connection with:

    curl -X GET "[ELASTIC_SEARCH_URL]:9200"
    

    You should receive a response with version and cluster information

  2. Verify Redis connection with:

    redis-cli -u [REDIS_URL] ping
    

    The response should be “PONG”

2

Deploy Microservices

Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.

Ensure both services are included in your values.yaml configuration:

prismeai-crawler:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
    
prismeai-searchengine:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
3

Verify Deployment

Check that both services are running correctly:

kubectl get pods -n apps | grep 'crawler\|searchengine'

Both services should show Running status and be ready (e.g., 1/1).

4

Configure Network Access

Ensure the services can access:

  1. ElasticSearch and Redis internally
  2. Internet access for the Crawler service
  3. Access from other Prisme.ai services that will use search functionality

Microservice Testing

After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:

1

Create a Test SearchEngine

Create a searchengine instance to crawl a test website:

curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "websites": [
        "https://docs.eda.prisme.ai/en/workspaces/"
    ]
}'

If successful, you should receive a complete searchengine object that includes an id field.

2

Check Crawl Progress

After a few seconds, check the crawl history and statistics:

curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
--header 'Content-Type: application/json' \
--data '{
    "urls": ["https://docs.eda.prisme.ai/en/workspaces/"]
}'

Verify that:

  • The metrics.indexed_pages field is greater than 0
  • The metrics.pending_requests field indicates active crawling
  • The crawl_history section shows pages that have been processed
3

Test Search Functionality

Perform a test search query to verify indexing and search:

curl --location 'http://localhost:8000/search/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "query": "workspace"
}'

The response should include a results array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.

If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.

Features and Capabilities

Integration with Prisme.ai

The Crawler and SearchEngine microservices integrate with other Prisme.ai components:

AI Knowledge

  • Create knowledge bases from crawled web content
  • Enrich existing knowledge bases with web information
  • Use search capabilities for better information retrieval

AI Builder

  • Build custom search interfaces using search API
  • Integrate search results into workflows
  • Trigger crawls programmatically in automations

AI Store

  • Power research agents with web crawling capabilities
  • Create domain-specific search tools
  • Develop content discovery applications

Custom Code

  • Extend crawling behavior with custom functions
  • Process search results with specialized logic
  • Create advanced search and discovery experiences

Advanced Configuration

Troubleshooting

Security Considerations

When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:

Network Security

  • Implement appropriate network policies
  • Consider using a dedicated proxy for outbound crawling
  • Monitor for unusual traffic patterns

Content Security

  • Be mindful of crawling and indexing sensitive content
  • Implement URL patterns to exclude sensitive areas
  • Consider content filtering before indexing

Authentication

  • Secure ElasticSearch and Redis with strong authentication
  • Implement API access controls for search endpoints
  • Use TLS for all service communications

Compliance

  • Respect website terms of service when crawling
  • Consider data retention policies for crawled content
  • Be aware of copyright implications of content indexing

For any issues or questions during the deployment process, contact support@prisme.ai for assistance.