Skip to main content
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.

Overview

The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:

prismeai-crawler

Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction

prismeai-searchengine

Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.

Installation Prerequisites

Before deploying these microservices, ensure you have access to the following dependencies:

ElasticSearch

Required for document storage and search functionality
  • Can use the same ElasticSearch instance as the core deployment
  • Stores indexed content and search metadata
  • Provides the search functionality backend

Redis

Required for inter-service communication
  • Can use the same Redis instance as the core deployment
  • Manages crawl queues and job scheduling
  • Facilitates communication between services
  • Stores temporary processing data

Configuration

Environment Variables

Configure the Crawler and SearchEngine microservices with the following environment variables:
Variable NameDescriptionDefault ValueAffected Services
REDIS_URLRedis connection URL for communication between servicesredis://localhost:6379Both
ELASTIC_URLElasticSearch connection URL for document storagehttp://localhost:9200Both
ELASTIC_USERElasticSearch userBoth
ELASTIC_USERElasticSearch passwordBoth
Variable NameDescriptionDefault ValueAffected Services
MAX_CONTENT_LENMaximum length (in characters) of documents crawled150000prismeai-crawler
CONCURRENT_REQUESTSThe maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader16prismeai-crawler
CONCURRENT_REQUESTS_PER_DOMAINThe maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain.16prismeai-crawler
DOWNLOAD_DELAYMinimum seconds to wait between 2 consecutive requests to the same domain.0prismeai-crawler
REQUEST_QUEUES_POLLING_INTERVALInterval in seconds between each poll of the Redis queue for new requests to process. Lower values increase responsiveness but also Redis load.2prismeai-crawler
REQUEST_QUEUES_POLLING_SIZENumber of requests to start from the queue in a single poll1prismeai-crawler
USER_AGENTCrawler HTTP user agentPrisme.ai (https://prisme.ai)prismeai-crawler
ROBOTSTXT_OBEYWhether the crawler should respect the site’s robots.txt.

- Recommended: True for all public websites.
- For internal portals (e.g. SharePoint, intranet), you may set False if robots.txt prevents access to content you are authorized to crawl.

⚠️ Disabling this option should only be done in controlled/internal environments.
Trueprismeai-crawler
These settings control the acknowledgment mechanism that ensures zero data loss when the crawler crashes during request processing.
Variable NameDescriptionDefault ValueAffected Services
ACK_ENABLEDEnable/disable the request acknowledgment mechanism. When enabled, requests are tracked atomically in Redis and automatically reclaimed if the crawler crashes during processing. Disabling this reverts to the original behavior where crashed requests are lost.Trueprismeai-crawler
PENDING_VISIBILITY_TIMEOUTTime in seconds before a pending request is considered stale and eligible for reclaim. Set this to 2-3x your longest document processing time (e.g., large PDFs). If a request takes longer than this timeout, it will be retried.900 (15 minutes)prismeai-crawler
PENDING_RECLAIM_INTERVALInterval in seconds between checks for stale pending requests. The reclaim loop runs at this frequency to find and re-queue abandoned tasks.60prismeai-crawler
These settings control adaptive resource monitoring to prevent OOM (Out of Memory) crashes through CPU and memory monitoring. The crawler will pause polling when resources are constrained and resume when they recover.
Thresholds are percentages of container limits, not system resources. Limits are auto-detected from cgroup files (cgroup v1/v2). For restricted Kubernetes clusters where cgroup detection fails, you can set CONTAINER_CPU_LIMIT_MILLICORES and CONTAINER_MEMORY_LIMIT_BYTES as fallbacks.
Variable NameDescriptionDefault ValueAffected Services
RESOURCE_THROTTLING_ENABLEDEnable/disable resource-based throttling. When enabled, the crawler monitors memory and CPU usage and pauses polling when thresholds are exceeded to prevent OOM crashes.Trueprismeai-crawler
RESOURCE_THROTTLING_MODEWhat resources to monitor. Options: memory_only (recommended - Kubernetes handles CPU throttling effectively), cpu_only, both, none. Memory monitoring is critical as memory exhaustion leads to OOMKilled.memory_onlyprismeai-crawler
MEMORY_WARNING_THRESHOLDMemory usage percentage (of container limit) above which the crawler pauses polling for new requests. Existing requests continue processing.80prismeai-crawler
MEMORY_CRITICAL_THRESHOLDMemory usage percentage (of container limit) above which the crawler stops all processing. This is a last-resort protection against OOM.95prismeai-crawler
CPU_WARNING_THRESHOLDCPU usage percentage (of container limit) above which the crawler pauses polling. Only applies when RESOURCE_THROTTLING_MODE includes CPU monitoring.85prismeai-crawler
CPU_CRITICAL_THRESHOLDCPU usage percentage (of container limit) above which the crawler stops all processing. Only applies when RESOURCE_THROTTLING_MODE includes CPU monitoring.95prismeai-crawler
RESOURCE_CHECK_INTERVALInterval in seconds between resource usage checks. Lower values provide faster response to resource pressure but increase overhead.1.0prismeai-crawler
RESOURCE_CIRCUIT_BREAKER_THRESHOLDNumber of consecutive resource check failures before escalating to a CRITICAL alert. This helps identify persistent resource issues vs. temporary spikes.10prismeai-crawler
RESOURCE_CIRCUIT_BREAKER_SHUTDOWN_TIMEOUTSeconds the crawler can remain stuck in critical state before initiating automatic pod shutdown. The pod exits gracefully, allowing Kubernetes to restart it with fresh state. Set to 0 to disable auto-shutdown.900 (15 minutes)prismeai-crawler
RESOURCE_CIRCUIT_BREAKER_FORCE_EXIT_TIMEOUTHard timeout in seconds for graceful reactor shutdown during circuit breaker auto-shutdown. If the reactor doesn’t stop within this time, os._exit(1) is called to force termination.60prismeai-crawler
These settings control how documents (PDFs, DOCX, etc.) are parsed and processed by the crawler.
Variable NameDescriptionDefault ValueAffected Services
DOCUMENTS_DEFAULT_PARSERDefault parser for document processing. Options: unstructured (recommended, uses the Unstructured library), tika (Apache Tika - requires TIKA_PATH to be configured), docling (IBM Docling - requires HUGGINGFACE_MODEL_PATH). Can be overridden per-searchengine via parser configuration.unstructuredprismeai-crawler
UNSTRUCTURED_DEFAULT_STRATEGYDefault parsing strategy for the Unstructured library. Options: fast (no OCR, fastest), auto (automatic detection), hi_res (full OCR processing, slowest but most accurate for scanned documents), ocr_only (OCR only). Can be overridden per-searchengine.fastprismeai-crawler
TIKA_OCR_SKIPSkip OCR (Tesseract) processing when using Apache Tika parser. Set to True for faster parsing at the cost of not extracting text from images within documents. Similar to Unstructured’s fast strategy. Can be overridden per-searchengine using parser config: {"type": "tika", "skip_ocr": true}.Trueprismeai-crawler

Resource Considerations

When planning your deployment, consider these resource recommendations:

Memory Requirements

  • Crawler: Min 1GB, recommended 2GB+
  • SearchEngine: Min 1GB, recommended 2GB+
  • Scale based on crawl volume and index size

CPU Allocation

  • Crawler: Min 0.5 vCPU, recommended 1+ vCPU
  • SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
  • Consider additional resources for high request volumes

Storage Needs

  • ElasticSearch: Plan for index growth based on content volume
  • Redis: Minimal requirements for queue management
  • Consider storage class with good I/O performance

Network Configuration

  • Internet access for the Crawler service
  • Internal network access between services
  • Consider bandwidth requirements for crawl activities

Deployment Process

Follow these steps to deploy the Crawler and SearchEngine microservices:
1

Configure Dependencies

Ensure ElasticSearch and Redis are accessible:
  1. Verify ElasticSearch connection with:
    curl -X GET "[ELASTIC_SEARCH_URL]:9200"
    
    You should receive a response with version and cluster information
  2. Verify Redis connection with:
    redis-cli -u [REDIS_URL] ping
    
    The response should be “PONG”
2

Deploy Microservices

Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.Ensure both services are included in your values.yaml configuration:
prismeai-crawler:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
    
prismeai-searchengine:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
3

Verify Deployment

Check that both services are running correctly:
kubectl get pods -n apps | grep 'crawler\|searchengine'
Both services should show Running status and be ready (e.g., 1/1).
4

Configure Network Access

Ensure the services can access:
  1. ElasticSearch and Redis internally
  2. Internet access for the Crawler service
  3. Access from other Prisme.ai services that will use search functionality

Microservice Testing

After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
1

Create a Test SearchEngine

Create a searchengine instance to crawl a test website:
curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "websites": [
        "https://docs.eda.prisme.ai/en/workspaces/"
    ]
}'
If successful, you should receive a complete searchengine object that includes an id field.
2

Check Crawl Progress

After a few seconds, check the crawl history and statistics:
curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
--header 'Content-Type: application/json' \
--data '{
    "urls": ["https://docs.eda.prisme.ai/en/workspaces/"]
}'
Verify that:
  • The metrics.indexed_pages field is greater than 0
  • The metrics.pending_requests field indicates active crawling
  • The crawl_history section shows pages that have been processed
3

Test Search Functionality

Perform a test search query to verify indexing and search:
curl --location 'http://localhost:8000/search/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "query": "workspace"
}'
The response should include a results array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.

Features and Capabilities

The Crawler service provides advanced web content discovery and extraction:
  • Configurable crawl depth: Control how many links deep the crawler will explore
  • URL filtering: Include or exclude specific URL patterns
  • Rate limiting: Respect website terms of service with configurable crawl rates
  • Content extraction: Parse and clean HTML to extract meaningful content
  • Metadata extraction: Capture titles, descriptions, and other metadata
  • Scheduled crawls: Set up periodic recrawling to keep content fresh
  • Robots.txt compliance: Respect website crawling policies
The SearchEngine service delivers powerful search functionality:
  • Full-text search: Find content across all indexed documents
  • Relevance ranking: Surface the most relevant content first
  • Content highlighting: Highlight matching terms in search results
  • Faceted search: Filter results by metadata fields
  • Synonym handling: Find content using related terms
  • Language support: Index and search content in multiple languages
  • Query suggestions: Support for “did you mean” functionality
  • Result snippets: Show context around matching terms

Integration with Prisme.ai

The Crawler and SearchEngine microservices integrate with other Prisme.ai components:

AI Knowledge

  • Create knowledge bases from crawled web content
  • Enrich existing knowledge bases with web information
  • Use search capabilities for better information retrieval

AI Builder

  • Build custom search interfaces using search API
  • Integrate search results into workflows
  • Trigger crawls programmatically in automations

AI Store

  • Power research agents with web crawling capabilities
  • Create domain-specific search tools
  • Develop content discovery applications

Custom Code

  • Extend crawling behavior with custom functions
  • Process search results with specialized logic
  • Create advanced search and discovery experiences

Advanced Configuration

When creating a searchengine, you can specify advanced crawl options:
{
  "websites": ["https://example.com"],
  "options": {
    "maxDepth": 3,
    "includePatterns": ["*/blog/*", "*/products/*"],
    "excludePatterns": ["*/admin/*", "*/login/*"],
    "respectRobotsTxt": true,
    "crawlDelay": 1000,
    "userAgent": "Prisme.ai Crawler",
    "maxPagesPerSite": 1000
  }
}
These options allow you to fine-tune crawling behavior for different use cases.
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
  • Configure index settings like sharding and replication
  • Set up index lifecycle policies for managing index growth
  • Implement custom analyzers for specialized search needs
  • Configure cross-cluster search for large-scale deployments
Consult the ElasticSearch documentation for more information on these advanced configurations.
To optimize performance for your specific needs:
  • Adjust MAX_CONTENT_LEN to balance comprehensiveness with resource usage
  • Configure crawler concurrency settings for faster crawling
  • Implement ElasticSearch performance optimizations
  • Consider Redis caching strategies for frequent searches
  • Use horizontal scaling for high-volume crawling and search scenarios

Troubleshooting

Symptom: Web pages are not being crawled or indexedPossible causes:
  • Network connectivity issues
  • Website robots.txt restrictions
  • Rate limiting by target websites
  • URL pattern configuration excluding relevant pages
Resolution steps:
  1. Check crawler logs for specific error messages
  2. Verify network connectivity to target websites
  3. Review website robots.txt for restrictions
  4. Adjust crawl rate settings to avoid being blocked
  5. Check URL pattern configurations
Symptom: Search results are missing or irrelevantPossible causes:
  • Content not properly indexed
  • ElasticSearch configuration issues
  • Query formatting problems
  • Content exceeding maximum length limits
Resolution steps:
  1. Verify content was successfully crawled and indexed
  2. Check ElasticSearch connectivity and health
  3. Review search query format and parameters
  4. Check if content exceeds MAX_CONTENT_LEN setting
  5. Test simple queries to validate basic functionality
Symptom: Slow crawling or search response timesPossible causes:
  • Insufficient resources allocated
  • ElasticSearch performance problems
  • Redis bottlenecks
  • Large crawl queues or index sizes
Resolution steps:
  1. Monitor resource usage during operations
  2. Check ElasticSearch performance metrics
  3. Verify Redis isn’t running out of memory
  4. Consider scaling resources horizontally or vertically
  5. Implement more targeted crawling strategies

Security Considerations

When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:

Network Security

  • Implement appropriate network policies
  • Consider using a dedicated proxy for outbound crawling
  • Monitor for unusual traffic patterns

Content Security

  • Be mindful of crawling and indexing sensitive content
  • Implement URL patterns to exclude sensitive areas
  • Consider content filtering before indexing

Authentication

  • Secure ElasticSearch and Redis with strong authentication
  • Implement API access controls for search endpoints
  • Use TLS for all service communications

Compliance

  • Respect website terms of service when crawling
  • Consider data retention policies for crawled content
  • Be aware of copyright implications of content indexing
For any issues or questions during the deployment process, contact [email protected] for assistance.