Crawler & SearchEngine Microservices
Deploy and configure the Prisme.ai web crawling and search engine capabilities for knowledge base creation and content discovery
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.
Overview
The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:
prismeai-crawler
Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction
prismeai-searchengine
Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.
Installation Prerequisites
Before deploying these microservices, ensure you have access to the following dependencies:
ElasticSearch
Required for document storage and search functionality
- Can use the same ElasticSearch instance as the core deployment
- Stores indexed content and search metadata
- Provides the search functionality backend
Redis
Required for inter-service communication
- Can use the same Redis instance as the core deployment
- Manages crawl queues and job scheduling
- Facilitates communication between services
- Stores temporary processing data
Configuration
Environment Variables
Configure the Crawler and SearchEngine microservices with the following environment variables:
Common Environment Variables
Common Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
REDIS_URL | Redis connection URL for communication between services | redis://localhost:6379 | Both |
ELASTIC_SEARCH_URL | ElasticSearch connection URL for document storage | localhost | Both |
Crawler-Specific Environment Variables
Crawler-Specific Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
MAX_CONTENT_LEN | Maximum length (in characters) of documents crawled | 150000 | prismeai-crawler |
CONCURRENT_REQUESTS | The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader | 16 | prismeai-crawler |
CONCURRENT_REQUESTS_PER_DOMAIN | The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain. | 16 | prismeai-crawler |
DOWNLOAD_DELAY | Minimum seconds to wait between 2 consecutive requests to the same domain. | 0 | prismeai-crawler |
REQUEST_QUEUES_POLLING_INTERVAL | Interval in seconds between each time we pull new requests from the queue | 5 | prismeai-crawler |
REQUEST_QUEUES_POLLING_SIZE | Number of requests to start from the queue in a single poll | 1 | prismeai-crawler |
ELASTIC_SEARCH_URL
can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.
Resource Considerations
When planning your deployment, consider these resource recommendations:
Memory Requirements
- Crawler: Min 1GB, recommended 2GB+
- SearchEngine: Min 1GB, recommended 2GB+
- Scale based on crawl volume and index size
CPU Allocation
- Crawler: Min 0.5 vCPU, recommended 1+ vCPU
- SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
- Consider additional resources for high request volumes
Storage Needs
- ElasticSearch: Plan for index growth based on content volume
- Redis: Minimal requirements for queue management
- Consider storage class with good I/O performance
Network Configuration
- Internet access for the Crawler service
- Internal network access between services
- Consider bandwidth requirements for crawl activities
Deployment Process
Follow these steps to deploy the Crawler and SearchEngine microservices:
Configure Dependencies
Ensure ElasticSearch and Redis are accessible:
-
Verify ElasticSearch connection with:
You should receive a response with version and cluster information
-
Verify Redis connection with:
The response should be “PONG”
Deploy Microservices
Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.
Ensure both services are included in your values.yaml
configuration:
Verify Deployment
Check that both services are running correctly:
Both services should show Running
status and be ready (e.g., 1/1
).
Configure Network Access
Ensure the services can access:
- ElasticSearch and Redis internally
- Internet access for the Crawler service
- Access from other Prisme.ai services that will use search functionality
Microservice Testing
After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
Create a Test SearchEngine
Create a searchengine instance to crawl a test website:
If successful, you should receive a complete searchengine object that includes an id
field.
Check Crawl Progress
After a few seconds, check the crawl history and statistics:
Verify that:
- The
metrics.indexed_pages
field is greater than 0 - The
metrics.pending_requests
field indicates active crawling - The
crawl_history
section shows pages that have been processed
Test Search Functionality
Perform a test search query to verify indexing and search:
The response should include a results
array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.
Features and Capabilities
Web Crawling
Web Crawling
The Crawler service provides advanced web content discovery and extraction:
- Configurable crawl depth: Control how many links deep the crawler will explore
- URL filtering: Include or exclude specific URL patterns
- Rate limiting: Respect website terms of service with configurable crawl rates
- Content extraction: Parse and clean HTML to extract meaningful content
- Metadata extraction: Capture titles, descriptions, and other metadata
- Scheduled crawls: Set up periodic recrawling to keep content fresh
- Robots.txt compliance: Respect website crawling policies
Search Capabilities
Search Capabilities
The SearchEngine service delivers powerful search functionality:
- Full-text search: Find content across all indexed documents
- Relevance ranking: Surface the most relevant content first
- Content highlighting: Highlight matching terms in search results
- Faceted search: Filter results by metadata fields
- Synonym handling: Find content using related terms
- Language support: Index and search content in multiple languages
- Query suggestions: Support for “did you mean” functionality
- Result snippets: Show context around matching terms
Integration with Prisme.ai
The Crawler and SearchEngine microservices integrate with other Prisme.ai components:
AI Knowledge
- Create knowledge bases from crawled web content
- Enrich existing knowledge bases with web information
- Use search capabilities for better information retrieval
AI Builder
- Build custom search interfaces using search API
- Integrate search results into workflows
- Trigger crawls programmatically in automations
AI Store
- Power research agents with web crawling capabilities
- Create domain-specific search tools
- Develop content discovery applications
Custom Code
- Extend crawling behavior with custom functions
- Process search results with specialized logic
- Create advanced search and discovery experiences
Advanced Configuration
Crawl Configuration Options
Crawl Configuration Options
When creating a searchengine, you can specify advanced crawl options:
These options allow you to fine-tune crawling behavior for different use cases.
ElasticSearch Index Management
ElasticSearch Index Management
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
- Configure index settings like sharding and replication
- Set up index lifecycle policies for managing index growth
- Implement custom analyzers for specialized search needs
- Configure cross-cluster search for large-scale deployments
Consult the ElasticSearch documentation for more information on these advanced configurations.
Performance Tuning
Performance Tuning
To optimize performance for your specific needs:
- Adjust
MAX_CONTENT_LEN
to balance comprehensiveness with resource usage - Configure crawler concurrency settings for faster crawling
- Implement ElasticSearch performance optimizations
- Consider Redis caching strategies for frequent searches
- Use horizontal scaling for high-volume crawling and search scenarios
Troubleshooting
Crawling Issues
Crawling Issues
Symptom: Web pages are not being crawled or indexed
Possible causes:
- Network connectivity issues
- Website robots.txt restrictions
- Rate limiting by target websites
- URL pattern configuration excluding relevant pages
Resolution steps:
- Check crawler logs for specific error messages
- Verify network connectivity to target websites
- Review website robots.txt for restrictions
- Adjust crawl rate settings to avoid being blocked
- Check URL pattern configurations
Search Problems
Search Problems
Symptom: Search results are missing or irrelevant
Possible causes:
- Content not properly indexed
- ElasticSearch configuration issues
- Query formatting problems
- Content exceeding maximum length limits
Resolution steps:
- Verify content was successfully crawled and indexed
- Check ElasticSearch connectivity and health
- Review search query format and parameters
- Check if content exceeds
MAX_CONTENT_LEN
setting - Test simple queries to validate basic functionality
Performance Issues
Performance Issues
Symptom: Slow crawling or search response times
Possible causes:
- Insufficient resources allocated
- ElasticSearch performance problems
- Redis bottlenecks
- Large crawl queues or index sizes
Resolution steps:
- Monitor resource usage during operations
- Check ElasticSearch performance metrics
- Verify Redis isn’t running out of memory
- Consider scaling resources horizontally or vertically
- Implement more targeted crawling strategies
Security Considerations
When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:
Network Security
- Implement appropriate network policies
- Consider using a dedicated proxy for outbound crawling
- Monitor for unusual traffic patterns
Content Security
- Be mindful of crawling and indexing sensitive content
- Implement URL patterns to exclude sensitive areas
- Consider content filtering before indexing
Authentication
- Secure ElasticSearch and Redis with strong authentication
- Implement API access controls for search endpoints
- Use TLS for all service communications
Compliance
- Respect website terms of service when crawling
- Consider data retention policies for crawled content
- Be aware of copyright implications of content indexing
For any issues or questions during the deployment process, contact support@prisme.ai for assistance.