Crawler

The Crawler app is a powerful infrastructure component in the Prisme.ai ecosystem that enables you to extract, process, and utilize content from websites. It transforms web content and documents (pdf, ppt…) into structured data that can be used in your AI solutions, knowledge bases, and automations.

Overview

The Crawler is a specialized microservice provided by Prisme.ai that handles the complex process of web content extraction:

Web Content Extraction

Automatically extract content from websites and web pages

Content Processing

Transform web content and documents into structured, usable data

Selective Crawling

Target specific content through URL patterns and CSS selectors

Scheduling

Set up regular crawling jobs to keep information current

This infrastructure app is particularly valuable for creating AI knowledge bases, maintaining up-to-date information, and automating content-based workflows.

Key Features

Content Extraction
Targeting Options
Processing Capabilities
Scheduling and Automation

Extract various types of content from websites:

Text Content: Articles, documentation, product information
Structured Data: Tables, lists, and other formatted content
Metadata: Page titles, descriptions, authors, dates
Navigation Structure: Site hierarchies and relationships
Links and References: Internal and external connections

The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.

How the Crawler Works

The Crawler follows a systematic process to extract and process web content:

Configuration

Define what to crawl and how to process it:

Specify starting URLs
Set URL patterns for inclusion/exclusion
Define content selectors for authentication
Define periodicity

Discovery

Start with initial URLs and discover additional content:

Visit starting pages
Identify links to follow
Filter links based on patterns
Build a crawl frontier

Extraction

Visit pages and extract content:

Render pages (including JavaScript content)
Apply CSS selectors to target specific content
Extract text, structured data, and metadata
Process and clean the extracted content

Processing

Transform extracted content into usable formats:

Clean and normalize text
Identify and preserve structure (headings, lists, etc.)
Extract metadata and attributes
Categorize and classify content

Storage

Store processed content for use in your applications:

Save to structured storage
Index for search and retrieval
Associate with metadata and source information
Make available for knowledge bases and automations

This process transforms web content into structured, searchable information that can power your AI applications.

Configuration Options

The Crawler app provides an extensive configuration to tailor its behavior to your specific needs, domain by domain.\

Do not include www. prefix in websites_config domains

Target websites

Control which websites are crawled:

  slug: Crawler
  config:
    websiteURL:
      - https://www.issy.com/
      - https://www.issy-tourisme-international.com
      - https://quotes.toscrape.com/tag/love/
    crawlerId: issy.com

This configuration start crawling the 3 configured domains and follows every discovered URL belonging to these domains + path.
For https://quotes.toscrape.com/tag/love/, it will only follow urls under /tag/love/ path : in that case, only https://quotes.toscrape.com/tag/love/page/2/ will be discovered and other found url will be ignored.

URL blacklist

Blacklist specific URL patterns :

slug: Crawler
config:
  mode: auto
  paused_crawl: false
  websites_config:
    issy.com:
      blacklisted_patterns:
        - /publications.*?id=.*
    issy-tourisme-international.com:
      blacklisted_patterns:
        - /recherche.*
  websiteURL:
    - https://www.issy.com/
    - https://www.issy-tourisme-international.com
    - https://quotes.toscrape.com/
  crawlerId: issy.com

This configuration :

excludes all URLs paths beginning with /recherche for www.issy-tourisme-international.com
excludes all URLs paths beginning with /publications for issy.com and with a id= query string parameter
crawls every discovered URL for quotes.toscrape.com

Content xpath filter

Configure what text content is extracted with xpath filters :

slug: Crawler
config:
  websiteURL:
    - https://www.mywebsite.fr/    
  websites_config:
    mywebsite.fr:
      xpath_filter: ancestor::main

This configuration only extracts text under any <main> HTML tags.

xpath_filter is always included inside this parent xpath :

/html/body/descendant::text()[not(ancestor::style) and not(ancestor::script) and not(ancestor::header) and not(ancestor::footer) and {xpath_filter}]

Useful xpath examples :

ancestor::div[contains(@id, 'cal_page-article-fiche')]/header/div

not(ancestor::*[contains(@id, 'ref-bottom') or contains(@id, 'ref-header')])

not(ancestor::*[contains(@class, 'navbar')]) and ancestor::p

Sitemap crawling

Provide a websiteURL ending with sitemap.xml to only crawl the URLs listed in that sitemap :

  slug: Crawler
  config:
    websiteURL:
      - https://www.example.com/sitemap.xml
    crawlerId: example.com

Scheduling Configuration

Set up recurring crawl schedules:

  slug: Crawler
  config:
    periodicity: 259200

This configuration schedules a crawl to be run every 3 days.

Content extraction methods

You can choose different content extraction strategies, separately for HTML or Documents.

  slug: Crawler
  config:
    websites_config:
      docs.prisme.ai:
        parsers:
          documents: docling
          html: docling

This configuration will use docling to parse the documents and html from the domain docs.prisme.ai.Possible values are:

Documents: unstructured (default) or docling
html: xpath (default) or docling

The docling option will return a markdown formatted body for the documents and html, while unstructured and xpath will be plain text with no specific structure.

docling is slower to process documents (and more resource intensive) than unstructured, but on-par for html with xpath.

You can send additional arguments specific to Docling by creating a dedicated Alias for this parser. Here is an example showing the available options:

  slug: Crawler
  config:
    parsers:
      myDoclingWithOptions:
        type: docling
        PdfPipelineOptions:
          generate_picture_images: true  # Upload each image found in the file
          generate_page_images: true     # Will generate an image for the whole page
          images_scale: 2                # Size of the image for detailsparsing quality
    websites_config:
      docs.prisme.ai:
        parsers:
          documents: myDoclingWithOptions
          html: myDoclingWithOptions

Common Use Cases

The Crawler app enables a wide range of use cases:

Knowledge Base Creation

Build comprehensive AI knowledge bases from website content:

Documentation portals
Product information sites
Support knowledge bases
Internal wikis

Content Monitoring

Keep track of changes and updates on important websites:

Competitor websites
Industry news sources
Regulatory publications
Product documentation

Data Collection

Gather structured data from web sources:

Product catalogs
Price information
Company directories
Research publications

Website to RAG Agent

Transform websites into conversational AI agents:

Company websites
Documentation portals
Educational resources
Knowledge repositories

Integration with Other Prisme.ai Products

The Crawler app works seamlessly with other Prisme.ai products:

AI Knowledge
AI Builder
Custom Code
Collection

The Crawler is a primary data source for AI Knowledge:

Extract web content for knowledge bases
Keep information current through scheduled crawls
Process and structure content for optimal retrieval
Preserve source attribution for transparency

This integration enables the creation of AI agents that can answer questions based on website content.

Example: Creating a Website RAG Agent

One of the most popular use cases for the Crawler is creating a Retrieval-Augmented Generation (RAG) agent based on website content:

Configure the Crawler

Set up the Crawler to extract content from the target website:

slug: Crawler
config:
  websiteURL:
    - https://www.example.com/
  websites_config:
    mywebsite.fr:
      xpath_filter: '(ancestor:article or ancestor::*[contains(@class, 'main-content')])'

Create an AI Knowledge Project

Set up a new project in AI Knowledge to house the crawled content:

Create a new project
Configure embedding and chunk settings
Set up appropriate processing options

Connect the Crawler to AI Knowledge

Configure the connection between the Crawler and AI Knowledge:

Select the crawler as a data source
Map crawled content to knowledge base structure
Configure metadata and attribute mapping

Run the Initial Crawl

Execute the first crawl to populate your knowledge base:

Monitor the crawl progress
Verify content extraction quality
Address any issues or adjustments needed

Configure the RAG Agent

Set up an AI agent that uses the crawled content:

Configure prompt templates
Set retrieval parameters
Define response formatting
Test with sample questions

Deploy and Monitor

Make the agent available and keep it current:

Publish the agent to AI Store
Set up scheduled crawls for updates
Monitor usage and performance
Refine based on feedback

The result is an AI agent that can answer questions based on the content of the website, providing accurate and up-to-date information.

Best Practices

Follow these recommendations to get the most from the Crawler app:

Respectful Crawling

Be a good web citizen:

Respect robots.txt directives
Implement appropriate rate limiting
Identify your crawler with a meaningful user agent
Crawl during off-peak hours when possible
Only extract content you have permission to use

These practices help maintain good relationships with the websites you crawl.

Content Selection

Be selective about what you crawl:

Focus on the most valuable content
Use specific CSS selectors for precision
Exclude boilerplate and repetitive content
Be mindful of content that changes frequently
Consider the information architecture of the site

Targeted crawling improves efficiency and content quality.

Incremental Updates

Optimize ongoing crawling:

Use incremental updates when possible
Schedule crawls based on content update frequency
Focus on changed content rather than recrawling everything
Implement change detection mechanisms
Archive historical versions when appropriate

These approaches minimize resource usage while keeping content current.

Error Handling

Prepare for crawling challenges:

Monitor for crawl failures
Implement retries for transient errors.
Have fallback mechanisms for critical content
Set up notifications for persistent issues
Regularly review crawl logs

Robust error handling ensures reliable content extraction.

Limitations and Considerations

When using the Crawler app, be aware of these considerations:

JavaScript-Heavy Sites: Some websites rely heavily on JavaScript for content rendering. The Crawler includes JavaScript processing capabilities, but complex single-page applications might present challenges.
Authentication Requirements: Websites requiring login can be crawled, but require additional configuration for authentication handling.
Legal and Terms of Service: Always ensure you have the right to crawl and use the content from websites, respecting their terms of service.
Dynamic Content: Content that changes based on user interaction or personalization may not be fully captured.
Resource Intensity: Comprehensive crawling can be resource-intensive. Consider scope and frequency based on your needs and available resources.

Next Steps

API Integrations

Learn about connecting to external APIs

Custom Code

Execute custom logic within your workflows

Collection

Manage data with simplified database access

Website to RAG Tutorial

Follow a detailed tutorial on creating a website RAG agent

Integration Overview

Apps Marketplace

Overview

Web Content Extraction

Content Processing

Selective Crawling

Scheduling

Key Features

How the Crawler Works

Configuration Options

Common Use Cases

Knowledge Base Creation

Content Monitoring

Data Collection

Website to RAG Agent

Integration with Other Prisme.ai Products

Example: Creating a Website RAG Agent

Best Practices

Limitations and Considerations

Next Steps

API Integrations

Custom Code

Collection

Website to RAG Tutorial

Integration Overview

Apps Marketplace

​Overview

Web Content Extraction

Content Processing

Selective Crawling

Scheduling

​Key Features

​How the Crawler Works

​Configuration Options

​Common Use Cases

Knowledge Base Creation

Content Monitoring

Data Collection

Website to RAG Agent

​Integration with Other Prisme.ai Products

​Example: Creating a Website RAG Agent

​Best Practices

​Limitations and Considerations

​Next Steps

API Integrations

Custom Code

Collection

Website to RAG Tutorial

Overview

Key Features

How the Crawler Works

Configuration Options

Common Use Cases

Integration with Other Prisme.ai Products

Example: Creating a Website RAG Agent

Best Practices

Limitations and Considerations

Next Steps