Crawler & SearchEngine Microservices
Deploy and configure the Prisme.ai web crawling and search engine capabilities for knowledge base creation and content discovery
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.
Overview
The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:
prismeai-crawler
Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction
prismeai-searchengine
Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.
Installation Prerequisites
Before deploying these microservices, ensure you have access to the following dependencies:
ElasticSearch
Required for document storage and search functionality
- Can use the same ElasticSearch instance as the core deployment
- Stores indexed content and search metadata
- Provides the search functionality backend
Redis
Required for inter-service communication
- Can use the same Redis instance as the core deployment
- Manages crawl queues and job scheduling
- Facilitates communication between services
- Stores temporary processing data
Configuration
Environment Variables
Configure the Crawler and SearchEngine microservices with the following environment variables:
ELASTIC_SEARCH_URL
can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.
Resource Considerations
When planning your deployment, consider these resource recommendations:
Memory Requirements
- Crawler: Min 1GB, recommended 2GB+
- SearchEngine: Min 1GB, recommended 2GB+
- Scale based on crawl volume and index size
CPU Allocation
- Crawler: Min 0.5 vCPU, recommended 1+ vCPU
- SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
- Consider additional resources for high request volumes
Storage Needs
- ElasticSearch: Plan for index growth based on content volume
- Redis: Minimal requirements for queue management
- Consider storage class with good I/O performance
Network Configuration
- Internet access for the Crawler service
- Internal network access between services
- Consider bandwidth requirements for crawl activities
Deployment Process
Follow these steps to deploy the Crawler and SearchEngine microservices:
Configure Dependencies
Ensure ElasticSearch and Redis are accessible:
-
Verify ElasticSearch connection with:
You should receive a response with version and cluster information
-
Verify Redis connection with:
The response should be “PONG”
Deploy Microservices
Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.
Ensure both services are included in your values.yaml
configuration:
Verify Deployment
Check that both services are running correctly:
Both services should show Running
status and be ready (e.g., 1/1
).
Configure Network Access
Ensure the services can access:
- ElasticSearch and Redis internally
- Internet access for the Crawler service
- Access from other Prisme.ai services that will use search functionality
Microservice Testing
After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
Create a Test SearchEngine
Create a searchengine instance to crawl a test website:
If successful, you should receive a complete searchengine object that includes an id
field.
Check Crawl Progress
After a few seconds, check the crawl history and statistics:
Verify that:
- The
metrics.indexed_pages
field is greater than 0 - The
metrics.pending_requests
field indicates active crawling - The
crawl_history
section shows pages that have been processed
Test Search Functionality
Perform a test search query to verify indexing and search:
The response should include a results
array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.
Features and Capabilities
Integration with Prisme.ai
The Crawler and SearchEngine microservices integrate with other Prisme.ai components:
AI Knowledge
- Create knowledge bases from crawled web content
- Enrich existing knowledge bases with web information
- Use search capabilities for better information retrieval
AI Builder
- Build custom search interfaces using search API
- Integrate search results into workflows
- Trigger crawls programmatically in automations
AI Store
- Power research agents with web crawling capabilities
- Create domain-specific search tools
- Develop content discovery applications
Custom Code
- Extend crawling behavior with custom functions
- Process search results with specialized logic
- Create advanced search and discovery experiences
Advanced Configuration
Troubleshooting
Security Considerations
When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:
Network Security
- Implement appropriate network policies
- Consider using a dedicated proxy for outbound crawling
- Monitor for unusual traffic patterns
Content Security
- Be mindful of crawling and indexing sensitive content
- Implement URL patterns to exclude sensitive areas
- Consider content filtering before indexing
Authentication
- Secure ElasticSearch and Redis with strong authentication
- Implement API access controls for search endpoints
- Use TLS for all service communications
Compliance
- Respect website terms of service when crawling
- Consider data retention policies for crawled content
- Be aware of copyright implications of content indexing
For any issues or questions during the deployment process, contact support@prisme.ai for assistance.
Was this page helpful?