Deploy and configure the Prisme.ai web crawling and search engine capabilities for knowledge base creation and content discovery
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.
The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:
Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction
Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.
Before deploying these microservices, ensure you have access to the following dependencies:
Required for document storage and search functionality
Required for inter-service communication
Configure the Crawler and SearchEngine microservices with the following environment variables:
Common Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
REDIS_URL | Redis connection URL for communication between services | redis://localhost:6379 | Both |
ELASTIC_SEARCH_URL | ElasticSearch connection URL for document storage | localhost | Both |
Crawler-Specific Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
MAX_CONTENT_LEN | Maximum length (in characters) of documents crawled | 150000 | prismeai-crawler |
CONCURRENT_REQUESTS | The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader | 16 | prismeai-crawler |
CONCURRENT_REQUESTS_PER_DOMAIN | The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain. | 16 | prismeai-crawler |
DOWNLOAD_DELAY | Minimum seconds to wait between 2 consecutive requests to the same domain. | 0 | prismeai-crawler |
REQUEST_QUEUES_POLLING_INTERVAL | Interval in seconds between each time we pull new requests from the queue | 5 | prismeai-crawler |
REQUEST_QUEUES_POLLING_SIZE | Number of requests to start from the queue in a single poll | 1 | prismeai-crawler |
ELASTIC_SEARCH_URL
can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.
When planning your deployment, consider these resource recommendations:
Follow these steps to deploy the Crawler and SearchEngine microservices:
Configure Dependencies
Ensure ElasticSearch and Redis are accessible:
Verify ElasticSearch connection with:
You should receive a response with version and cluster information
Verify Redis connection with:
The response should be “PONG”
Deploy Microservices
Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.
Ensure both services are included in your values.yaml
configuration:
Verify Deployment
Check that both services are running correctly:
Both services should show Running
status and be ready (e.g., 1/1
).
Configure Network Access
Ensure the services can access:
After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
Create a Test SearchEngine
Create a searchengine instance to crawl a test website:
If successful, you should receive a complete searchengine object that includes an id
field.
Check Crawl Progress
After a few seconds, check the crawl history and statistics:
Verify that:
metrics.indexed_pages
field is greater than 0metrics.pending_requests
field indicates active crawlingcrawl_history
section shows pages that have been processedTest Search Functionality
Perform a test search query to verify indexing and search:
The response should include a results
array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.
Web Crawling
The Crawler service provides advanced web content discovery and extraction:
Search Capabilities
The SearchEngine service delivers powerful search functionality:
The Crawler and SearchEngine microservices integrate with other Prisme.ai components:
Crawl Configuration Options
When creating a searchengine, you can specify advanced crawl options:
These options allow you to fine-tune crawling behavior for different use cases.
ElasticSearch Index Management
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
Consult the ElasticSearch documentation for more information on these advanced configurations.
Performance Tuning
To optimize performance for your specific needs:
MAX_CONTENT_LEN
to balance comprehensiveness with resource usageCrawling Issues
Symptom: Web pages are not being crawled or indexed
Possible causes:
Resolution steps:
Search Problems
Symptom: Search results are missing or irrelevant
Possible causes:
Resolution steps:
MAX_CONTENT_LEN
settingPerformance Issues
Symptom: Slow crawling or search response times
Possible causes:
Resolution steps:
When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:
For any issues or questions during the deployment process, contact support@prisme.ai for assistance.
Deploy and configure the Prisme.ai web crawling and search engine capabilities for knowledge base creation and content discovery
The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.
The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:
Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction
Indexes processed content and provides search capabilities with relevance ranking and content highlighting
These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.
Before deploying these microservices, ensure you have access to the following dependencies:
Required for document storage and search functionality
Required for inter-service communication
Configure the Crawler and SearchEngine microservices with the following environment variables:
Common Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
REDIS_URL | Redis connection URL for communication between services | redis://localhost:6379 | Both |
ELASTIC_SEARCH_URL | ElasticSearch connection URL for document storage | localhost | Both |
Crawler-Specific Environment Variables
Variable Name | Description | Default Value | Affected Services |
---|---|---|---|
MAX_CONTENT_LEN | Maximum length (in characters) of documents crawled | 150000 | prismeai-crawler |
CONCURRENT_REQUESTS | The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader | 16 | prismeai-crawler |
CONCURRENT_REQUESTS_PER_DOMAIN | The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain. | 16 | prismeai-crawler |
DOWNLOAD_DELAY | Minimum seconds to wait between 2 consecutive requests to the same domain. | 0 | prismeai-crawler |
REQUEST_QUEUES_POLLING_INTERVAL | Interval in seconds between each time we pull new requests from the queue | 5 | prismeai-crawler |
REQUEST_QUEUES_POLLING_SIZE | Number of requests to start from the queue in a single poll | 1 | prismeai-crawler |
ELASTIC_SEARCH_URL
can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.
When planning your deployment, consider these resource recommendations:
Follow these steps to deploy the Crawler and SearchEngine microservices:
Configure Dependencies
Ensure ElasticSearch and Redis are accessible:
Verify ElasticSearch connection with:
You should receive a response with version and cluster information
Verify Redis connection with:
The response should be “PONG”
Deploy Microservices
Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.
Ensure both services are included in your values.yaml
configuration:
Verify Deployment
Check that both services are running correctly:
Both services should show Running
status and be ready (e.g., 1/1
).
Configure Network Access
Ensure the services can access:
After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:
Create a Test SearchEngine
Create a searchengine instance to crawl a test website:
If successful, you should receive a complete searchengine object that includes an id
field.
Check Crawl Progress
After a few seconds, check the crawl history and statistics:
Verify that:
metrics.indexed_pages
field is greater than 0metrics.pending_requests
field indicates active crawlingcrawl_history
section shows pages that have been processedTest Search Functionality
Perform a test search query to verify indexing and search:
The response should include a results
array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.
If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.
Web Crawling
The Crawler service provides advanced web content discovery and extraction:
Search Capabilities
The SearchEngine service delivers powerful search functionality:
The Crawler and SearchEngine microservices integrate with other Prisme.ai components:
Crawl Configuration Options
When creating a searchengine, you can specify advanced crawl options:
These options allow you to fine-tune crawling behavior for different use cases.
ElasticSearch Index Management
The services automatically create and manage ElasticSearch indices. For advanced use cases, you can:
Consult the ElasticSearch documentation for more information on these advanced configurations.
Performance Tuning
To optimize performance for your specific needs:
MAX_CONTENT_LEN
to balance comprehensiveness with resource usageCrawling Issues
Symptom: Web pages are not being crawled or indexed
Possible causes:
Resolution steps:
Search Problems
Symptom: Search results are missing or irrelevant
Possible causes:
Resolution steps:
MAX_CONTENT_LEN
settingPerformance Issues
Symptom: Slow crawling or search response times
Possible causes:
Resolution steps:
When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:
For any issues or questions during the deployment process, contact support@prisme.ai for assistance.