Skip to main content
The Crawler app is a powerful infrastructure component in the Prisme.ai ecosystem that enables you to extract, process, and utilize content from websites. It transforms web content and documents (pdf, ppt…) into structured data that can be used in your AI solutions, knowledge bases, and automations.

Overview

The Crawler is a specialized microservice provided by Prisme.ai that handles the complex process of web content extraction:

Web Content Extraction

Automatically extract content from websites and web pages

Content Processing

Transform web content and documents into structured, usable data

Selective Crawling

Target specific content through URL patterns and CSS selectors

Scheduling

Set up regular crawling jobs to keep information current
This infrastructure app is particularly valuable for creating AI knowledge bases, maintaining up-to-date information, and automating content-based workflows.

Key Features

  • Content Extraction
  • Targeting Options
  • Processing Capabilities
  • Scheduling and Automation
Extract various types of content from websites:
  • Text Content: Articles, documentation, product information
  • Structured Data: Tables, lists, and other formatted content
  • Metadata: Page titles, descriptions, authors, dates
  • Navigation Structure: Site hierarchies and relationships
  • Links and References: Internal and external connections
The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.

How the Crawler Works

The Crawler follows a systematic process to extract and process web content:
1

Configuration

Define what to crawl and how to process it:
  • Specify starting URLs
  • Set URL patterns for inclusion/exclusion
  • Define content selectors for authentication
  • Define periodicity
2

Discovery

Start with initial URLs and discover additional content:
  • Visit starting pages
  • Identify links to follow
  • Filter links based on patterns
  • Build a crawl frontier
3

Extraction

Visit pages and extract content:
  • Render pages (including JavaScript content)
  • Apply CSS selectors to target specific content
  • Extract text, structured data, and metadata
  • Process and clean the extracted content
4

Processing

Transform extracted content into usable formats:
  • Clean and normalize text
  • Identify and preserve structure (headings, lists, etc.)
  • Extract metadata and attributes
  • Categorize and classify content
5

Storage

Store processed content for use in your applications:
  • Save to structured storage
  • Index for search and retrieval
  • Associate with metadata and source information
  • Make available for knowledge bases and automations
This process transforms web content into structured, searchable information that can power your AI applications.

Configuration Options

The Crawler app provides an extensive configuration to tailor its behavior to your specific needs, domain by domain.\
Do not include www. prefix in websites_config domains
Control which websites are crawled:
  slug: Crawler
  config:
    websiteURL:
      - https://www.issy.com/
      - https://www.issy-tourisme-international.com
      - https://quotes.toscrape.com/tag/love/
    crawlerId: issy.com
This configuration start crawling the 3 configured domains and follows every discovered URL belonging to these domains + path.
For https://quotes.toscrape.com/tag/love/, it will only follow urls under /tag/love/ path : in that case, only https://quotes.toscrape.com/tag/love/page/2/ will be discovered and other found url will be ignored.
Blacklist specific URL patterns :
slug: Crawler
config:
  mode: auto
  paused_crawl: false
  websites_config:
    issy.com:
      blacklisted_patterns:
        - /publications.*?id=.*
    issy-tourisme-international.com:
      blacklisted_patterns:
        - /recherche.*
  websiteURL:
    - https://www.issy.com/
    - https://www.issy-tourisme-international.com
    - https://quotes.toscrape.com/
  crawlerId: issy.com
This configuration :
  • excludes all URLs paths beginning with /recherche for www.issy-tourisme-international.com
  • excludes all URLs paths beginning with /publications for issy.com and with a id= query string parameter
  • crawls every discovered URL for quotes.toscrape.com
Configure what text content is extracted with xpath filters :
slug: Crawler
config:
  websiteURL:
    - https://www.mywebsite.fr/    
  websites_config:
    mywebsite.fr:
      xpath_filter: ancestor::main
This configuration only extracts text under any <main> HTML tags.
xpath_filter is always included inside this parent xpath : /html/body/descendant::text()[not(ancestor::style) and not(ancestor::script) and not(ancestor::header) and not(ancestor::footer) and {xpath_filter}]
Provide a websiteURL ending with sitemap.xml to only crawl the URLs listed in that sitemap :
  slug: Crawler
  config:
    websiteURL:
      - https://www.example.com/sitemap.xml
    crawlerId: example.com
Set up recurring crawl schedules:
  slug: Crawler
  config:
    periodicity: 259200
This configuration schedules a crawl to be run every 3 days.
You can choose different content extraction strategies, separately for HTML or Documents.
  slug: Crawler
  config:
    websites_config:
      docs.prisme.ai:
        parsers:
          documents: docling
          html: docling
This configuration will use docling to parse the documents and html from the domain docs.prisme.ai.Possible values are:
  • Documents: unstructured (default) or docling
  • html: xpath (default) or docling
The docling option will return a markdown formatted body for the documents and html, while unstructured and xpath will be plain text with no specific structure.
docling is slower to process documents (and more resource intensive) than unstructured, but on-par for html with xpath.
You can send additional arguments specific to Docling by creating a dedicated Alias for this parser. Here is an example showing the available options:
  slug: Crawler
  config:
    parsers:
      myDoclingWithOptions:
        type: docling
        PdfPipelineOptions:
          generate_picture_images: true  # Upload each image found in the file
          generate_page_images: true     # Will generate an image for the whole page
          images_scale: 2                # Size of the image for detailsparsing quality
    websites_config:
      docs.prisme.ai:
        parsers:
          documents: myDoclingWithOptions
          html: myDoclingWithOptions

Common Use Cases

The Crawler app enables a wide range of use cases:

Knowledge Base Creation

Build comprehensive AI knowledge bases from website content:
  • Documentation portals
  • Product information sites
  • Support knowledge bases
  • Internal wikis

Content Monitoring

Keep track of changes and updates on important websites:
  • Competitor websites
  • Industry news sources
  • Regulatory publications
  • Product documentation

Data Collection

Gather structured data from web sources:
  • Product catalogs
  • Price information
  • Company directories
  • Research publications

Website to RAG Agent

Transform websites into conversational AI agents:
  • Company websites
  • Documentation portals
  • Educational resources
  • Knowledge repositories

Integration with Other Prisme.ai Products

The Crawler app works seamlessly with other Prisme.ai products:
  • AI Knowledge
  • AI Builder
  • Custom Code
  • Collection
The Crawler is a primary data source for AI Knowledge:
  • Extract web content for knowledge bases
  • Keep information current through scheduled crawls
  • Process and structure content for optimal retrieval
  • Preserve source attribution for transparency
This integration enables the creation of AI agents that can answer questions based on website content.

Example: Creating a Website RAG Agent

One of the most popular use cases for the Crawler is creating a Retrieval-Augmented Generation (RAG) agent based on website content:
1

Configure the Crawler

Set up the Crawler to extract content from the target website:
slug: Crawler
config:
  websiteURL:
    - https://www.example.com/
  websites_config:
    mywebsite.fr:
      xpath_filter: '(ancestor:article or ancestor::*[contains(@class, 'main-content')])'
2

Create an AI Knowledge Project

Set up a new project in AI Knowledge to house the crawled content:
  • Create a new project
  • Configure embedding and chunk settings
  • Set up appropriate processing options
3

Connect the Crawler to AI Knowledge

Configure the connection between the Crawler and AI Knowledge:
  • Select the crawler as a data source
  • Map crawled content to knowledge base structure
  • Configure metadata and attribute mapping
4

Run the Initial Crawl

Execute the first crawl to populate your knowledge base:
  • Monitor the crawl progress
  • Verify content extraction quality
  • Address any issues or adjustments needed
5

Configure the RAG Agent

Set up an AI agent that uses the crawled content:
  • Configure prompt templates
  • Set retrieval parameters
  • Define response formatting
  • Test with sample questions
6

Deploy and Monitor

Make the agent available and keep it current:
  • Publish the agent to AI Store
  • Set up scheduled crawls for updates
  • Monitor usage and performance
  • Refine based on feedback
The result is an AI agent that can answer questions based on the content of the website, providing accurate and up-to-date information.

Best Practices

Follow these recommendations to get the most from the Crawler app:
Be a good web citizen:
  • Respect robots.txt directives
  • Implement appropriate rate limiting
  • Identify your crawler with a meaningful user agent
  • Crawl during off-peak hours when possible
  • Only extract content you have permission to use
These practices help maintain good relationships with the websites you crawl.
Be selective about what you crawl:
  • Focus on the most valuable content
  • Use specific CSS selectors for precision
  • Exclude boilerplate and repetitive content
  • Be mindful of content that changes frequently
  • Consider the information architecture of the site
Targeted crawling improves efficiency and content quality.
Optimize ongoing crawling:
  • Use incremental updates when possible
  • Schedule crawls based on content update frequency
  • Focus on changed content rather than recrawling everything
  • Implement change detection mechanisms
  • Archive historical versions when appropriate
These approaches minimize resource usage while keeping content current.
Prepare for crawling challenges:
  • Monitor for crawl failures
  • Implement retries for transient errors.
  • Have fallback mechanisms for critical content
  • Set up notifications for persistent issues
  • Regularly review crawl logs
Robust error handling ensures reliable content extraction.

Limitations and Considerations

When using the Crawler app, be aware of these considerations:
  • JavaScript-Heavy Sites: Some websites rely heavily on JavaScript for content rendering. The Crawler includes JavaScript processing capabilities, but complex single-page applications might present challenges.
  • Authentication Requirements: Websites requiring login can be crawled, but require additional configuration for authentication handling.
  • Legal and Terms of Service: Always ensure you have the right to crawl and use the content from websites, respecting their terms of service.
  • Dynamic Content: Content that changes based on user interaction or personalization may not be fully captured.
  • Resource Intensity: Comprehensive crawling can be resource-intensive. Consider scope and frequency based on your needs and available resources.

Next Steps

I