The Crawler app is a powerful infrastructure component in the Prisme.ai ecosystem that enables you to extract, process, and utilize content from websites. It transforms web content and documents (pdf, ppt…) into structured data that can be used in your AI solutions, knowledge bases, and automations.

Overview

The Crawler is a specialized microservice provided by Prisme.ai that handles the complex process of web content extraction:

Web Content Extraction

Automatically extract content from websites and web pages

Content Processing

Transform web content and documents into structured, usable data

Selective Crawling

Target specific content through URL patterns and CSS selectors

Scheduling

Set up regular crawling jobs to keep information current

This infrastructure app is particularly valuable for creating AI knowledge bases, maintaining up-to-date information, and automating content-based workflows.

Key Features

Extract various types of content from websites:

  • Text Content: Articles, documentation, product information
  • Structured Data: Tables, lists, and other formatted content
  • Metadata: Page titles, descriptions, authors, dates
  • Navigation Structure: Site hierarchies and relationships
  • Links and References: Internal and external connections

The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.

How the Crawler Works

The Crawler follows a systematic process to extract and process web content:

1

Configuration

Define what to crawl and how to process it:

  • Specify starting URLs
  • Set URL patterns for inclusion/exclusion
  • Define content selectors for authentication
  • Define periodicity
2

Discovery

Start with initial URLs and discover additional content:

  • Visit starting pages
  • Identify links to follow
  • Filter links based on patterns
  • Build a crawl frontier
3

Extraction

Visit pages and extract content:

  • Render pages (including JavaScript content)
  • Apply CSS selectors to target specific content
  • Extract text, structured data, and metadata
  • Process and clean the extracted content
4

Processing

Transform extracted content into usable formats:

  • Clean and normalize text
  • Identify and preserve structure (headings, lists, etc.)
  • Extract metadata and attributes
  • Categorize and classify content
5

Storage

Store processed content for use in your applications:

  • Save to structured storage
  • Index for search and retrieval
  • Associate with metadata and source information
  • Make available for knowledge bases and automations

This process transforms web content into structured, searchable information that can power your AI applications.

Configuration Options

The Crawler app provides an extensive configuration to tailor its behavior to your specific needs, domain by domain.

Do not include www. prefix in websites_config domains

Common Use Cases

The Crawler app enables a wide range of use cases:

Knowledge Base Creation

Build comprehensive AI knowledge bases from website content:

  • Documentation portals
  • Product information sites
  • Support knowledge bases
  • Internal wikis

Content Monitoring

Keep track of changes and updates on important websites:

  • Competitor websites
  • Industry news sources
  • Regulatory publications
  • Product documentation

Data Collection

Gather structured data from web sources:

  • Product catalogs
  • Price information
  • Company directories
  • Research publications

Website to RAG Agent

Transform websites into conversational AI agents:

  • Company websites
  • Documentation portals
  • Educational resources
  • Knowledge repositories

Integration with Other Prisme.ai Products

The Crawler app works seamlessly with other Prisme.ai products:

The Crawler is a primary data source for AI Knowledge:

  • Extract web content for knowledge bases
  • Keep information current through scheduled crawls
  • Process and structure content for optimal retrieval
  • Preserve source attribution for transparency

This integration enables the creation of AI agents that can answer questions based on website content.

Example: Creating a Website RAG Agent

One of the most popular use cases for the Crawler is creating a Retrieval-Augmented Generation (RAG) agent based on website content:

1

Configure the Crawler

Set up the Crawler to extract content from the target website:

slug: Crawler
config:
  websiteURL:
    - https://www.example.com/
  websites_config:
    mywebsite.fr:
      xpath_filter: '(ancestor:article or ancestor::*[contains(@class, 'main-content')])'
2

Create an AI Knowledge Project

Set up a new project in AI Knowledge to house the crawled content:

  • Create a new project
  • Configure embedding and chunk settings
  • Set up appropriate processing options
3

Connect the Crawler to AI Knowledge

Configure the connection between the Crawler and AI Knowledge:

  • Select the crawler as a data source
  • Map crawled content to knowledge base structure
  • Configure metadata and attribute mapping
4

Run the Initial Crawl

Execute the first crawl to populate your knowledge base:

  • Monitor the crawl progress
  • Verify content extraction quality
  • Address any issues or adjustments needed
5

Configure the RAG Agent

Set up an AI agent that uses the crawled content:

  • Configure prompt templates
  • Set retrieval parameters
  • Define response formatting
  • Test with sample questions
6

Deploy and Monitor

Make the agent available and keep it current:

  • Publish the agent to AI Store
  • Set up scheduled crawls for updates
  • Monitor usage and performance
  • Refine based on feedback

The result is an AI agent that can answer questions based on the content of the website, providing accurate and up-to-date information.

Best Practices

Follow these recommendations to get the most from the Crawler app:

Limitations and Considerations

When using the Crawler app, be aware of these considerations:

  • JavaScript-Heavy Sites: Some websites rely heavily on JavaScript for content rendering. The Crawler includes JavaScript processing capabilities, but complex single-page applications might present challenges.

  • Authentication Requirements: Websites requiring login can be crawled, but require additional configuration for authentication handling.

  • Legal and Terms of Service: Always ensure you have the right to crawl and use the content from websites, respecting their terms of service.

  • Dynamic Content: Content that changes based on user interaction or personalization may not be fully captured.

  • Resource Intensity: Comprehensive crawling can be resource-intensive. Consider scope and frequency based on your needs and available resources.

Next Steps