Crawler & SearchEngine Microservices

The Crawler and SearchEngine microservices work together to provide powerful web content indexing and search capabilities for your Prisme.ai platform. These services enable the creation of knowledge bases from web content, support external data integration, and power the search functionality across your applications.

Overview

The Crawler and SearchEngine microservices function as a pair to deliver comprehensive web content processing:

prismeai-crawler

Discovers, fetches, and processes web content from specified sources, managing crawl schedules and content extraction

prismeai-searchengine

Indexes processed content and provides search capabilities with relevance ranking and content highlighting

These services must be deployed together - you cannot use one without the other as they form a complete indexing and search solution.

Installation Prerequisites

Before deploying these microservices, ensure you have access to the following dependencies:

ElasticSearch

Required for document storage and search functionality

Can use the same ElasticSearch instance as the core deployment
Stores indexed content and search metadata
Provides the search functionality backend

Redis

Required for inter-service communication

Can use the same Redis instance as the core deployment
Manages crawl queues and job scheduling
Facilitates communication between services
Stores temporary processing data

Configuration

Environment Variables

Configure the Crawler and SearchEngine microservices with the following environment variables:

Common Environment Variables

Variable Name	Description	Default Value	Affected Services
`REDIS_URL`	Redis connection URL for communication between services	`redis://localhost:6379`	Both
`ELASTIC_SEARCH_URL`	ElasticSearch connection URL for document storage	`localhost`	Both

Crawler-Specific Environment Variables

Variable Name	Description	Default Value	Affected Services
`MAX_CONTENT_LEN`	Maximum length (in characters) of documents crawled	`150000`	prismeai-crawler
`CONCURRENT_REQUESTS`	The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader	`16`	prismeai-crawler
`CONCURRENT_REQUESTS_PER_DOMAIN`	The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain.	`16`	prismeai-crawler
`DOWNLOAD_DELAY`	Minimum seconds to wait between 2 consecutive requests to the same domain.	`0`	prismeai-crawler
`REQUEST_QUEUES_POLLING_INTERVAL`	Interval in seconds between each time we pull new requests from the queue	`5`	prismeai-crawler
`REQUEST_QUEUES_POLLING_SIZE`	Number of requests to start from the queue in a single poll	`1`	prismeai-crawler

ELASTIC_SEARCH_URL can be set to an empty string (”), which will prevent webpage content from being saved, effectively deactivating the search functionality while still allowing crawling.

Resource Considerations

When planning your deployment, consider these resource recommendations:

Memory Requirements

Crawler: Min 1GB, recommended 2GB+
SearchEngine: Min 1GB, recommended 2GB+
Scale based on crawl volume and index size

CPU Allocation

Crawler: Min 0.5 vCPU, recommended 1+ vCPU
SearchEngine: Min 0.5 vCPU, recommended 1+ vCPU
Consider additional resources for high request volumes

Storage Needs

ElasticSearch: Plan for index growth based on content volume
Redis: Minimal requirements for queue management
Consider storage class with good I/O performance

Network Configuration

Internet access for the Crawler service
Internal network access between services
Consider bandwidth requirements for crawl activities

Deployment Process

Follow these steps to deploy the Crawler and SearchEngine microservices:

Configure Dependencies

Ensure ElasticSearch and Redis are accessible:

Verify ElasticSearch connection with:
```
curl -X GET "[ELASTIC_SEARCH_URL]:9200"
```
You should receive a response with version and cluster information
Verify Redis connection with:
```
redis-cli -u [REDIS_URL] ping
```
The response should be “PONG”

Deploy Microservices

Deploy both services using Helm as described in the Self-Hosting Apps Microservices guide.

Ensure both services are included in your values.yaml configuration:

prismeai-crawler:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"
    
prismeai-searchengine:
  enabled: true
  config:
    redis:
      url: "redis://redis-service:6379"
    elasticsearch:
      url: "elasticsearch-service:9200"

Verify Deployment

Check that both services are running correctly:

kubectl get pods -n apps | grep 'crawler\|searchengine'

Both services should show Running status and be ready (e.g., 1/1).

Configure Network Access

Ensure the services can access:

ElasticSearch and Redis internally
Internet access for the Crawler service
Access from other Prisme.ai services that will use search functionality

Microservice Testing

After deploying the Crawler and SearchEngine microservices, verify their operation with these steps:

Create a Test SearchEngine

Create a searchengine instance to crawl a test website:

curl --location 'http://localhost:8000/monitor/searchengine/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "websites": [
        "https://docs.eda.prisme.ai/en/workspaces/"
    ]
}'

If successful, you should receive a complete searchengine object that includes an id field.

Check Crawl Progress

After a few seconds, check the crawl history and statistics:

curl --location --request GET 'http://localhost:8000/monitor/searchengine/test/test/stats' \
--header 'Content-Type: application/json' \
--data '{
    "urls": ["https://docs.eda.prisme.ai/en/workspaces/"]
}'

Verify that:

The metrics.indexed_pages field is greater than 0
The metrics.pending_requests field indicates active crawling
The crawl_history section shows pages that have been processed

Test Search Functionality

Perform a test search query to verify indexing and search:

curl --location 'http://localhost:8000/search/test/test' \
--header 'Content-Type: application/json' \
--data '{
    "query": "workspace"
}'

The response should include a results array containing pages from the crawled website that match your search term. Each result should include relevance information and content snippets.

If all tests pass, congratulations! Your Crawler and SearchEngine microservices are up and running correctly.

Features and Capabilities

Web Crawling

Search Capabilities

Integration with Prisme.ai

The Crawler and SearchEngine microservices integrate with other Prisme.ai components:

AI Knowledge

Create knowledge bases from crawled web content
Enrich existing knowledge bases with web information
Use search capabilities for better information retrieval

AI Builder

Build custom search interfaces using search API
Integrate search results into workflows
Trigger crawls programmatically in automations

AI Store

Power research agents with web crawling capabilities
Create domain-specific search tools
Develop content discovery applications

Custom Code

Extend crawling behavior with custom functions
Process search results with specialized logic
Create advanced search and discovery experiences

Advanced Configuration

Crawl Configuration Options

When creating a searchengine, you can specify advanced crawl options:

{
  "websites": ["https://example.com"],
  "options": {
    "maxDepth": 3,
    "includePatterns": ["*/blog/*", "*/products/*"],
    "excludePatterns": ["*/admin/*", "*/login/*"],
    "respectRobotsTxt": true,
    "crawlDelay": 1000,
    "userAgent": "Prisme.ai Crawler",
    "maxPagesPerSite": 1000
  }
}

These options allow you to fine-tune crawling behavior for different use cases.

ElasticSearch Index Management

Performance Tuning

Troubleshooting

Crawling Issues

Search Problems

Performance Issues

Security Considerations

When deploying and using the Crawler and SearchEngine microservices, keep these security considerations in mind:

Network Security

Implement appropriate network policies
Consider using a dedicated proxy for outbound crawling
Monitor for unusual traffic patterns

Content Security

Be mindful of crawling and indexing sensitive content
Implement URL patterns to exclude sensitive areas
Consider content filtering before indexing

Authentication

Secure ElasticSearch and Redis with strong authentication
Implement API access controls for search endpoints
Use TLS for all service communications

Compliance

Respect website terms of service when crawling
Consider data retention policies for crawled content
Be aware of copyright implications of content indexing

For any issues or questions during the deployment process, contact support@prisme.ai for assistance.

Overview

Cloud Providers

Docker & Kubernetes Deployment

Entreprise Services

AI Products

Operations

Crawler & SearchEngine Microservices

Overview

prismeai-crawler

prismeai-searchengine

Installation Prerequisites

ElasticSearch

Redis

Configuration

Environment Variables

Resource Considerations

Memory Requirements

CPU Allocation

Storage Needs

Network Configuration

Deployment Process

Microservice Testing

Features and Capabilities

Integration with Prisme.ai

AI Knowledge

AI Builder

AI Store

Custom Code

Advanced Configuration

Troubleshooting

Security Considerations

Network Security

Content Security

Authentication

Compliance

Overview

Cloud Providers

Docker & Kubernetes Deployment

Entreprise Services

AI Products

Operations

​Overview

prismeai-crawler

prismeai-searchengine

​Installation Prerequisites

ElasticSearch

Redis

​Configuration

​Environment Variables

​Resource Considerations

Memory Requirements

CPU Allocation

Storage Needs

Network Configuration

​Deployment Process

​Microservice Testing

​Features and Capabilities

​Integration with Prisme.ai

AI Knowledge

AI Builder

AI Store

Custom Code

​Advanced Configuration

​Troubleshooting

​Security Considerations

Network Security

Content Security

Authentication

Compliance

Overview

Installation Prerequisites

Configuration

Environment Variables

Resource Considerations

Deployment Process

Microservice Testing

Features and Capabilities

Integration with Prisme.ai

Advanced Configuration

Troubleshooting

Security Considerations