Overview
The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:prismeai-crawler
Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction
prismeai-searchengine
Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.
Installation Prerequisites
Before deploying these microservices, ensure you have access to the following dependencies:ElasticSearch
Required for document storage and search functionality
- Can use the same ElasticSearch instance as the core deployment
- Stores indexed content and search metadata
- Provides the search functionality backend
Redis
Required for inter-service communication
- Can use the same Redis instance as the core deployment
- Manages crawl queues and job scheduling
- Facilitates communication between services
- Stores temporary processing data
Configuration
Environment Variables
Configure the Crawler and SearchEngine microservices with the following environment variables:Common Environment Variables
Common Environment Variables
| Variable Name | Description | Default Value | Affected Services |
|---|---|---|---|
REDIS_URL | Redis connection URL for communication between services | redis://localhost:6379 | Both |
ELASTIC_URL | ElasticSearch connection URL for document storage | http://localhost:9200 | Both |
ELASTIC_USER | ElasticSearch user | “ | Both |
ELASTIC_USER | ElasticSearch password | “ | Both |
Crawler-Specific Environment Variables
Crawler-Specific Environment Variables
| Variable Name | Description | Default Value | Affected Services |
|---|---|---|---|
MAX_CONTENT_LEN | Maximum length (in characters) of documents crawled | 150000 | prismeai-crawler |
CONCURRENT_REQUESTS | The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader | 16 | prismeai-crawler |
CONCURRENT_REQUESTS_PER_DOMAIN | The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain. | 16 | prismeai-crawler |
DOWNLOAD_DELAY | Minimum seconds to wait between 2 consecutive requests to the same domain. | 0 | prismeai-crawler |
REQUEST_QUEUES_POLLING_INTERVAL | Interval in seconds between each poll of the Redis queue for new requests to process. Lower values increase responsiveness but also Redis load. | 2 | prismeai-crawler |
REQUEST_QUEUES_POLLING_SIZE | Number of requests to start from the queue in a single poll | 1 | prismeai-crawler |
USER_AGENT | Crawler HTTP user agent | Prisme.ai (https://prisme.ai) | prismeai-crawler |
ROBOTSTXT_OBEY | Whether the crawler should respect the site’s robots.txt. - Recommended: True for all public websites. - For internal portals (e.g. SharePoint, intranet), you may set False if robots.txt prevents access to content you are authorized to crawl. ⚠️ Disabling this option should only be done in controlled/internal environments. | True | prismeai-crawler |
ACK & Reliability Environment Variables
ACK & Reliability Environment Variables
These settings control the acknowledgment mechanism that ensures zero data loss when the crawler crashes during request processing.
| Variable Name | Description | Default Value | Affected Services |
|---|---|---|---|
ACK_ENABLED | Enable/disable the request acknowledgment mechanism. When enabled, requests are tracked atomically in Redis and automatically reclaimed if the crawler crashes during processing. Disabling this reverts to the original behavior where crashed requests are lost. | True | prismeai-crawler |
PENDING_VISIBILITY_TIMEOUT | Time in seconds before a pending request is considered stale and eligible for reclaim. Set this to 2-3x your longest document processing time (e.g., large PDFs). If a request takes longer than this timeout, it will be retried. | 900 (15 minutes) | prismeai-crawler |
PENDING_RECLAIM_INTERVAL | Interval in seconds between checks for stale pending requests. The reclaim loop runs at this frequency to find and re-queue abandoned tasks. | 60 | prismeai-crawler |
Resource Throttling Environment Variables
Resource Throttling Environment Variables
These settings control adaptive resource monitoring to prevent OOM (Out of Memory) crashes through CPU and memory monitoring. The crawler will pause polling when resources are constrained and resume when they recover.
Thresholds are percentages of container limits, not system resources. Limits are auto-detected from cgroup files (cgroup v1/v2). For restricted Kubernetes clusters where cgroup detection fails, you can set
CONTAINER_CPU_LIMIT_MILLICORES and CONTAINER_MEMORY_LIMIT_BYTES as fallbacks.| Variable Name | Description | Default Value | Affected Services |
|---|---|---|---|
RESOURCE_THROTTLING_ENABLED | Enable/disable resource-based throttling. When enabled, the crawler monitors memory and CPU usage and pauses polling when thresholds are exceeded to prevent OOM crashes. | True | prismeai-crawler |
RESOURCE_THROTTLING_MODE | What resources to monitor. Options: memory_only (recommended - Kubernetes handles CPU throttling effectively), cpu_only, both, none. Memory monitoring is critical as memory exhaustion leads to OOMKilled. | memory_only | prismeai-crawler |
MEMORY_WARNING_THRESHOLD | Memory usage percentage (of container limit) above which the crawler pauses polling for new requests. Existing requests continue processing. | 80 | prismeai-crawler |
MEMORY_CRITICAL_THRESHOLD | Memory usage percentage (of container limit) above which the crawler stops all processing. This is a last-resort protection against OOM. | 95 | prismeai-crawler |
CPU_WARNING_THRESHOLD | CPU usage percentage (of container limit) above which the crawler pauses polling. Only applies when RESOURCE_THROTTLING_MODE includes CPU monitoring. | 85 | prismeai-crawler |
CPU_CRITICAL_THRESHOLD | CPU usage percentage (of container limit) above which the crawler stops all processing. Only applies when RESOURCE_THROTTLING_MODE includes CPU monitoring. | 95 | prismeai-crawler |
RESOURCE_CHECK_INTERVAL | Interval in seconds between resource usage checks. Lower values provide faster response to resource pressure but increase overhead. | 1.0 | prismeai-crawler |
RESOURCE_CIRCUIT_BREAKER_THRESHOLD | Number of consecutive resource check failures before escalating to a CRITICAL alert. This helps identify persistent resource issues vs. temporary spikes. | 10 | prismeai-crawler |
RESOURCE_CIRCUIT_BREAKER_SHUTDOWN_TIMEOUT | Seconds the crawler can remain stuck in critical state before initiating automatic pod shutdown. The pod exits gracefully, allowing Kubernetes to restart it with fresh state. Set to 0 to disable auto-shutdown. | 900 (15 minutes) | prismeai-crawler |
RESOURCE_CIRCUIT_BREAKER_FORCE_EXIT_TIMEOUT | Hard timeout in seconds for graceful reactor shutdown during circuit breaker auto-shutdown. If the reactor doesn’t stop within this time, os._exit(1) is called to force termination. | 60 | prismeai-crawler |
Document Processing Environment Variables
Document Processing Environment Variables
These settings control how documents (PDFs, DOCX, etc.) are parsed and processed by the crawler.
| Variable Name | Description | Default Value | Affected Services |
|---|---|---|---|
DOCUMENTS_DEFAULT_PARSER | Default parser for document processing. Options: unstructured (recommended, uses the Unstructured library), tika (Apache Tika - requires TIKA_PATH to be configured), docling (IBM Docling - requires HUGGINGFACE_MODEL_PATH). Can be overridden per-searchengine via parser configuration. | unstructured | prismeai-crawler |
UNSTRUCTURED_DEFAULT_STRATEGY | Default parsing strategy for the Unstructured library. Options: fast (no OCR, fastest), auto (automatic detection), hi_res (full OCR processing, slowest but most accurate for scanned documents), ocr_only (OCR only). Can be overridden per-searchengine. | fast | prismeai-crawler |
TIKA_OCR_SKIP | Skip OCR (Tesseract) processing when using Apache Tika parser. Set to True for faster parsing at the cost of not extracting text from images within documents. Similar to Unstructured’s fast strategy. Can be overridden per-searchengine using parser config: {"type": "tika", "skip_ocr": true}. | True | prismeai-crawler |
Resource Considerations
When planning your deployment, consider these resource recommendations:Memory Requirements
- Crawler: Min 1GB, recommended 2GB+
- SearchEngine: Min 1GB, recommended 2GB+
- Scale based on crawl volume and index size
CPU Allocation
- Crawler: Min 0.5 vCPU, recommended 1+ vCPU
- SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
- Consider additional resources for high request volumes
Storage Needs
- ElasticSearch: Plan for index growth based on content volume
- Redis: Minimal requirements for queue management
- Consider storage class with good I/O performance
Network Configuration
- Internet access for the Crawler service
- Internal network access between services
- Consider bandwidth requirements for crawl activities
Deployment Process
Follow these steps to deploy the Crawler and SearchEngine microservices:1
Configure Dependencies
Ensure ElasticSearch and Redis are accessible:
-
Verify ElasticSearch connection with:
You should receive a response with version and cluster information
-
Verify Redis connection with:
The response should be “PONG”
2
Deploy Microservices
Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.Ensure both services are included in your
values.yaml configuration:3
Verify Deployment
Check that both services are running correctly:Both services should show
Running status and be ready (e.g., 1/1).4
Configure Network Access
Ensure the services can access:
- ElasticSearch and Redis internally
- Internet access for the Crawler service
- Access from other Prisme.ai services that will use search functionality
Microservice Testing
After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:1
Create a Test SearchEngine
Create a searchengine instance to crawl a test website:If successful, you should receive a complete searchengine object that includes an
id field.2
Check Crawl Progress
After a few seconds, check the crawl history and statistics:Verify that:
- The
metrics.indexed_pagesfield is greater than 0 - The
metrics.pending_requestsfield indicates active crawling - The
crawl_historysection shows pages that have been processed
3
Test Search Functionality
Perform a test search query to verify indexing and search:The response should include a
results array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.Features and Capabilities
Web Crawling
Web Crawling
The Crawler service provides advanced web content discovery and extraction:
- Configurable crawl depth: Control how many links deep the crawler will explore
- URL filtering: Include or exclude specific URL patterns
- Rate limiting: Respect website terms of service with configurable crawl rates
- Content extraction: Parse and clean HTML to extract meaningful content
- Metadata extraction: Capture titles, descriptions, and other metadata
- Scheduled crawls: Set up periodic recrawling to keep content fresh
- Robots.txt compliance: Respect website crawling policies
Search Capabilities
Search Capabilities
The SearchEngine service delivers powerful search functionality:
- Full-text search: Find content across all indexed documents
- Relevance ranking: Surface the most relevant content first
- Content highlighting: Highlight matching terms in search results
- Faceted search: Filter results by metadata fields
- Synonym handling: Find content using related terms
- Language support: Index and search content in multiple languages
- Query suggestions: Support for “did you mean” functionality
- Result snippets: Show context around matching terms
Integration with Prisme.ai
The Crawler and SearchEngine microservices integrate with other Prisme.ai components:AI Knowledge
- Create knowledge bases from crawled web content
- Enrich existing knowledge bases with web information
- Use search capabilities for better information retrieval
AI Builder
- Build custom search interfaces using search API
- Integrate search results into workflows
- Trigger crawls programmatically in automations
AI Store
- Power research agents with web crawling capabilities
- Create domain-specific search tools
- Develop content discovery applications
Custom Code
- Extend crawling behavior with custom functions
- Process search results with specialized logic
- Create advanced search and discovery experiences
Advanced Configuration
Crawl Configuration Options
Crawl Configuration Options
When creating a searchengine, you can specify advanced crawl options:These options allow you to fine-tune crawling behavior for different use cases.
ElasticSearch Index Management
ElasticSearch Index Management
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
- Configure index settings like sharding and replication
- Set up index lifecycle policies for managing index growth
- Implement custom analyzers for specialized search needs
- Configure cross-cluster search for large-scale deployments
Performance Tuning
Performance Tuning
To optimize performance for your specific needs:
- Adjust
MAX_CONTENT_LENto balance comprehensiveness with resource usage - Configure crawler concurrency settings for faster crawling
- Implement ElasticSearch performance optimizations
- Consider Redis caching strategies for frequent searches
- Use horizontal scaling for high-volume crawling and search scenarios
Troubleshooting
Crawling Issues
Crawling Issues
Symptom: Web pages are not being crawled or indexedPossible causes:
- Network connectivity issues
- Website robots.txt restrictions
- Rate limiting by target websites
- URL pattern configuration excluding relevant pages
- Check crawler logs for specific error messages
- Verify network connectivity to target websites
- Review website robots.txt for restrictions
- Adjust crawl rate settings to avoid being blocked
- Check URL pattern configurations
Search Problems
Search Problems
Symptom: Search results are missing or irrelevantPossible causes:
- Content not properly indexed
- ElasticSearch configuration issues
- Query formatting problems
- Content exceeding maximum length limits
- Verify content was successfully crawled and indexed
- Check ElasticSearch connectivity and health
- Review search query format and parameters
- Check if content exceeds
MAX_CONTENT_LENsetting - Test simple queries to validate basic functionality
Performance Issues
Performance Issues
Symptom: Slow crawling or search response timesPossible causes:
- Insufficient resources allocated
- ElasticSearch performance problems
- Redis bottlenecks
- Large crawl queues or index sizes
- Monitor resource usage during operations
- Check ElasticSearch performance metrics
- Verify Redis isn’t running out of memory
- Consider scaling resources horizontally or vertically
- Implement more targeted crawling strategies
Security Considerations
When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:Network Security
- Implement appropriate network policies
- Consider using a dedicated proxy for outbound crawling
- Monitor for unusual traffic patterns
Content Security
- Be mindful of crawling and indexing sensitive content
- Implement URL patterns to exclude sensitive areas
- Consider content filtering before indexing
Authentication
- Secure ElasticSearch and Redis with strong authentication
- Implement API access controls for search endpoints
- Use TLS for all service communications
Compliance
- Respect website terms of service when crawling
- Consider data retention policies for crawled content
- Be aware of copyright implications of content indexing