Skip to main content
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.

Overview

The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:

prismeai-crawler

Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction

prismeai-searchengine

Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.

Installation Prerequisites

Before deploying these microservices, ensure you have access to the following dependencies:

ElasticSearch

Required for document storage and search functionality
  • Can use the same ElasticSearch instance as the core deployment
  • Stores indexed content and search metadata
  • Provides the search functionality backend

Redis

Required for inter-service communication
  • Can use the same Redis instance as the core deployment
  • Manages crawl queues and job scheduling
  • Facilitates communication between services
  • Stores temporary processing data

Configuration

Environment Variables

Configure the Crawler and SearchEngine microservices with the following environment variables:
Variable NameDescriptionDefault ValueAffected Services
REDIS_URLRedis connection URL for communication between servicesredis://localhost:6379Both
ELASTIC_SEARCH_URLElasticSearch connection URL for document storagelocalhostBoth
Variable NameDescriptionDefault ValueAffected Services
MAX_CONTENT_LENMaximum length (in characters) of documents crawled150000prismeai-crawler
CONCURRENT_REQUESTSThe maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader16prismeai-crawler
CONCURRENT_REQUESTS_PER_DOMAINThe maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain.16prismeai-crawler
DOWNLOAD_DELAYMinimum seconds to wait between 2 consecutive requests to the same domain.0prismeai-crawler
REQUEST_QUEUES_POLLING_INTERVALInterval in seconds between each time we pull new requests from the queue5prismeai-crawler
REQUEST_QUEUES_POLLING_SIZENumber of requests to start from the queue in a single poll1prismeai-crawler
USER_AGENTCrawler HTTP user agentPrisme.ai (https://prisme.ai)prismeai-crawler
ROBOTSTXT_OBEYWhether the crawler should respect the site’s robots.txt.

- Recommended: True for all public websites.
- For internal portals (e.g. SharePoint, intranet), you may set False if robots.txt prevents access to content you are authorized to crawl.

⚠️ Disabling this option should only be done in controlled/internal environments.
Trueprismeai-crawler

Resource Considerations

When planning your deployment, consider these resource recommendations:

Memory Requirements

  • Crawler: Min 1GB, recommended 2GB+
  • SearchEngine: Min 1GB, recommended 2GB+
  • Scale based on crawl volume and index size

CPU Allocation

  • Crawler: Min 0.5 vCPU, recommended 1+ vCPU
  • SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
  • Consider additional resources for high request volumes

Storage Needs

  • ElasticSearch: Plan for index growth based on content volume
  • Redis: Minimal requirements for queue management
  • Consider storage class with good I/O performance

Network Configuration

  • Internet access for the Crawler service
  • Internal network access between services
  • Consider bandwidth requirements for crawl activities

Deployment Process

Follow these steps to deploy the Crawler and SearchEngine microservices:
1

Configure Dependencies

Ensure ElasticSearch and Redis are accessible:
  1. Verify ElasticSearch connection with:
    curl -X GET "[ELASTIC_SEARCH_URL]:9200"
    
    You should receive a response with version and cluster information
  2. Verify Redis connection with:
    redis-cli -u [REDIS_URL] ping
    
    The response should be “PONG”
2

Deploy Microservices

Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.Ensure both services are included in your values.yaml configuration:
prismeai-crawler:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
    
prismeai-searchengine:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
3

Verify Deployment

Check that both services are running correctly:
kubectl get pods -n apps | grep 'crawler\|searchengine'
Both services should show Running status and be ready (e.g., 1/1).
4

Configure Network Access

Ensure the services can access:
  1. ElasticSearch and Redis internally
  2. Internet access for the Crawler service
  3. Access from other Prisme.ai services that will use search functionality

Microservice Testing

After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
1

Create a Test SearchEngine

Create a searchengine instance to crawl a test website:
curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "websites": [
        "https://docs.eda.prisme.ai/en/workspaces/"
    ]
}'
If successful, you should receive a complete searchengine object that includes an id field.
2

Check Crawl Progress

After a few seconds, check the crawl history and statistics:
curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
--header 'Content-Type: application/json' \
--data '{
    "urls": ["https://docs.eda.prisme.ai/en/workspaces/"]
}'
Verify that:
  • The metrics.indexed_pages field is greater than 0
  • The metrics.pending_requests field indicates active crawling
  • The crawl_history section shows pages that have been processed
3

Test Search Functionality

Perform a test search query to verify indexing and search:
curl --location 'http://localhost:8000/search/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "query": "workspace"
}'
The response should include a results array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.

Features and Capabilities

The Crawler service provides advanced web content discovery and extraction:
  • Configurable crawl depth: Control how many links deep the crawler will explore
  • URL filtering: Include or exclude specific URL patterns
  • Rate limiting: Respect website terms of service with configurable crawl rates
  • Content extraction: Parse and clean HTML to extract meaningful content
  • Metadata extraction: Capture titles, descriptions, and other metadata
  • Scheduled crawls: Set up periodic recrawling to keep content fresh
  • Robots.txt compliance: Respect website crawling policies
The SearchEngine service delivers powerful search functionality:
  • Full-text search: Find content across all indexed documents
  • Relevance ranking: Surface the most relevant content first
  • Content highlighting: Highlight matching terms in search results
  • Faceted search: Filter results by metadata fields
  • Synonym handling: Find content using related terms
  • Language support: Index and search content in multiple languages
  • Query suggestions: Support for “did you mean” functionality
  • Result snippets: Show context around matching terms

Integration with Prisme.ai

The Crawler and SearchEngine microservices integrate with other Prisme.ai components:

AI Knowledge

  • Create knowledge bases from crawled web content
  • Enrich existing knowledge bases with web information
  • Use search capabilities for better information retrieval

AI Builder

  • Build custom search interfaces using search API
  • Integrate search results into workflows
  • Trigger crawls programmatically in automations

AI Store

  • Power research agents with web crawling capabilities
  • Create domain-specific search tools
  • Develop content discovery applications

Custom Code

  • Extend crawling behavior with custom functions
  • Process search results with specialized logic
  • Create advanced search and discovery experiences

Advanced Configuration

When creating a searchengine, you can specify advanced crawl options:
{
  "websites": ["https://example.com"],
  "options": {
    "maxDepth": 3,
    "includePatterns": ["*/blog/*", "*/products/*"],
    "excludePatterns": ["*/admin/*", "*/login/*"],
    "respectRobotsTxt": true,
    "crawlDelay": 1000,
    "userAgent": "Prisme.ai Crawler",
    "maxPagesPerSite": 1000
  }
}
These options allow you to fine-tune crawling behavior for different use cases.
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
  • Configure index settings like sharding and replication
  • Set up index lifecycle policies for managing index growth
  • Implement custom analyzers for specialized search needs
  • Configure cross-cluster search for large-scale deployments
Consult the ElasticSearch documentation for more information on these advanced configurations.
To optimize performance for your specific needs:
  • Adjust MAX_CONTENT_LEN to balance comprehensiveness with resource usage
  • Configure crawler concurrency settings for faster crawling
  • Implement ElasticSearch performance optimizations
  • Consider Redis caching strategies for frequent searches
  • Use horizontal scaling for high-volume crawling and search scenarios

Troubleshooting

Symptom: Web pages are not being crawled or indexedPossible causes:
  • Network connectivity issues
  • Website robots.txt restrictions
  • Rate limiting by target websites
  • URL pattern configuration excluding relevant pages
Resolution steps:
  1. Check crawler logs for specific error messages
  2. Verify network connectivity to target websites
  3. Review website robots.txt for restrictions
  4. Adjust crawl rate settings to avoid being blocked
  5. Check URL pattern configurations
Symptom: Search results are missing or irrelevantPossible causes:
  • Content not properly indexed
  • ElasticSearch configuration issues
  • Query formatting problems
  • Content exceeding maximum length limits
Resolution steps:
  1. Verify content was successfully crawled and indexed
  2. Check ElasticSearch connectivity and health
  3. Review search query format and parameters
  4. Check if content exceeds MAX_CONTENT_LEN setting
  5. Test simple queries to validate basic functionality
Symptom: Slow crawling or search response timesPossible causes:
  • Insufficient resources allocated
  • ElasticSearch performance problems
  • Redis bottlenecks
  • Large crawl queues or index sizes
Resolution steps:
  1. Monitor resource usage during operations
  2. Check ElasticSearch performance metrics
  3. Verify Redis isn’t running out of memory
  4. Consider scaling resources horizontally or vertically
  5. Implement more targeted crawling strategies

Security Considerations

When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:

Network Security

  • Implement appropriate network policies
  • Consider using a dedicated proxy for outbound crawling
  • Monitor for unusual traffic patterns

Content Security

  • Be mindful of crawling and indexing sensitive content
  • Implement URL patterns to exclude sensitive areas
  • Consider content filtering before indexing

Authentication

  • Secure ElasticSearch and Redis with strong authentication
  • Implement API access controls for search endpoints
  • Use TLS for all service communications

Compliance

  • Respect website terms of service when crawling
  • Consider data retention policies for crawled content
  • Be aware of copyright implications of content indexing
For any issues or questions during the deployment process, contact support@prisme.ai for assistance.
I