Crawler
Extract and process web content for AI knowledge bases and automations
The Crawler app is a powerful infrastructure component in the Prisme.ai ecosystem that enables you to extract, process, and utilize content from websites. It transforms web content and documents (pdf, ppt…) into structured data that can be used in your AI solutions, knowledge bases, and automations.
Overview
The Crawler is a specialized microservice provided by Prisme.ai that handles the complex process of web content extraction:
Web Content Extraction
Automatically extract content from websites and web pages
Content Processing
Transform web content and documents into structured, usable data
Selective Crawling
Target specific content through URL patterns and CSS selectors
Scheduling
Set up regular crawling jobs to keep information current
This infrastructure app is particularly valuable for creating AI knowledge bases, maintaining up-to-date information, and automating content-based workflows.
Key Features
Extract various types of content from websites:
- Text Content: Articles, documentation, product information
- Structured Data: Tables, lists, and other formatted content
- Metadata: Page titles, descriptions, authors, dates
- Navigation Structure: Site hierarchies and relationships
- Links and References: Internal and external connections
The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.
Extract various types of content from websites:
- Text Content: Articles, documentation, product information
- Structured Data: Tables, lists, and other formatted content
- Metadata: Page titles, descriptions, authors, dates
- Navigation Structure: Site hierarchies and relationships
- Links and References: Internal and external connections
The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.
Control exactly what content gets crawled:
- URL Patterns: Include or exclude content based on URL patterns
- CSS Selectors: Target specific page elements by CSS selectors
- Content Types: Filter by content type (text, images, etc.)
- Depth Control: Limit crawling to a specific number of levels
- Rate Limiting: Control crawl speed to be respectful of websites
- **Periodicity **: Control the periodicity of crawling
These targeting options allow you to focus on the most relevant content for your needs.
Transform extracted content for better usability:
- Content Cleaning: Remove boilerplate text and formatting
- Text Extraction: Convert HTML to clean, usable text
- Structure Preservation: Maintain headings, lists, and tables
- Metadata Extraction: Capture page properties and attributes
- Page Transformation: transform web to markdown (LLM freindly format)
These processing capabilities ensure that the extracted content is ready for use in your AI solutions.
Keep your content current with scheduling options:
- Recurring Crawls: Schedule regular updates
- Incremental Crawling: Focus on new or changed content
- Event-Triggered Crawls: Start crawls based on specific events
- Notification System: Get alerts about crawl status
These scheduling features help maintain the freshness of your knowledge base.
How the Crawler Works
The Crawler follows a systematic process to extract and process web content:
Configuration
Define what to crawl and how to process it:
- Specify starting URLs
- Set URL patterns for inclusion/exclusion
- Define content selectors for authentication
- Define periodicity
Discovery
Start with initial URLs and discover additional content:
- Visit starting pages
- Identify links to follow
- Filter links based on patterns
- Build a crawl frontier
Extraction
Visit pages and extract content:
- Render pages (including JavaScript content)
- Apply CSS selectors to target specific content
- Extract text, structured data, and metadata
- Process and clean the extracted content
Processing
Transform extracted content into usable formats:
- Clean and normalize text
- Identify and preserve structure (headings, lists, etc.)
- Extract metadata and attributes
- Categorize and classify content
Storage
Store processed content for use in your applications:
- Save to structured storage
- Index for search and retrieval
- Associate with metadata and source information
- Make available for knowledge bases and automations
This process transforms web content into structured, searchable information that can power your AI applications.
Configuration Options
The Crawler app provides an extensive configuration to tailor its behavior to your specific needs, domain by domain.\
Do not include www. prefix in websites_config domains
Target websites
Target websites
Control which websites are crawled:
This configuration start crawling the 3 configured domains and follows every discovered URL belonging to these domains + path.
For https://quotes.toscrape.com/tag/love/, it will only follow urls under /tag/love/
path : in that case, only https://quotes.toscrape.com/tag/love/page/2/ will be discovered and other found url will be ignored.
URL blacklist
URL blacklist
Blacklist specific URL patterns :
This configuration :
- excludes all URLs paths beginning with
/recherche
for www.issy-tourisme-international.com - excludes all URLs paths beginning with
/publications
for issy.com and with aid=
query string parameter - crawls every discovered URL for quotes.toscrape.com
Content xpath filter
Content xpath filter
Configure what text content is extracted with xpath filters :
This configuration only extracts text under any <main>
HTML tags.
xpath_filter is always included inside this parent xpath : /html/body/descendant::text()[not(ancestor::style) and not(ancestor::script) and not(ancestor::header) and not(ancestor::footer) and {xpath_filter}]
Sitemap crawling
Sitemap crawling
Provide a websiteURL
ending with sitemap.xml to only crawl the URLs listed in that sitemap :
Scheduling Configuration
Scheduling Configuration
Set up recurring crawl schedules:
This configuration schedules a crawl to be run every 3 days.
Content extraction methods
Content extraction methods
You can choose different content extraction strategies, separately for HTML or Documents.
This configuration will use docling to parse the documents and html from the domain docs.prisme.ai
.
Possible values are:
- Documents:
unstructured
(default) ordocling
- html:
xpath
(default) ordocling
The docling
option will return a markdown formatted body
for the documents and html, while unstructured and xpath will be plain text with no specific structure.
docling
is slower to process documents (and more resource intensive) than unstructured
, but on-par for html with xpath
.
You can send additional arguments specific to Docling by creating a dedicated Alias for this parser. Here is an example showing the available options:
Common Use Cases
The Crawler app enables a wide range of use cases:
Knowledge Base Creation
Build comprehensive AI knowledge bases from website content:
- Documentation portals
- Product information sites
- Support knowledge bases
- Internal wikis
Content Monitoring
Keep track of changes and updates on important websites:
- Competitor websites
- Industry news sources
- Regulatory publications
- Product documentation
Data Collection
Gather structured data from web sources:
- Product catalogs
- Price information
- Company directories
- Research publications
Website to RAG Agent
Transform websites into conversational AI agents:
- Company websites
- Documentation portals
- Educational resources
- Knowledge repositories
Integration with Other Prisme.ai Products
The Crawler app works seamlessly with other Prisme.ai products:
The Crawler is a primary data source for AI Knowledge:
- Extract web content for knowledge bases
- Keep information current through scheduled crawls
- Process and structure content for optimal retrieval
- Preserve source attribution for transparency
This integration enables the creation of AI agents that can answer questions based on website content.
The Crawler is a primary data source for AI Knowledge:
- Extract web content for knowledge bases
- Keep information current through scheduled crawls
- Process and structure content for optimal retrieval
- Preserve source attribution for transparency
This integration enables the creation of AI agents that can answer questions based on website content.
Use the Crawler in your automation workflows:
- Trigger automations based on website changes
- Process and transform web content
- Extract specific data for decision-making
- Integrate web content with other data sources
This enables sophisticated automations that leverage web data.
Combine the Crawler with Custom Code for advanced processing:
- Apply custom transformations to extracted content
- Implement specialized parsing logic
- Analyze content with custom algorithms
- Generate derived insights from crawled data
This combination provides maximum flexibility for handling web content.
Store crawled content in Collection for persistence and querying:
- Save structured content for later use
- Build queryable repositories of web information
- Track changes over time
- Combine with other data sources
This integration provides persistent storage and retrieval capabilities for crawled content.
Example: Creating a Website RAG Agent
One of the most popular use cases for the Crawler is creating a Retrieval-Augmented Generation (RAG) agent based on website content:
Configure the Crawler
Set up the Crawler to extract content from the target website:
Create an AI Knowledge Project
Set up a new project in AI Knowledge to house the crawled content:
- Create a new project
- Configure embedding and chunk settings
- Set up appropriate processing options
Connect the Crawler to AI Knowledge
Configure the connection between the Crawler and AI Knowledge:
- Select the crawler as a data source
- Map crawled content to knowledge base structure
- Configure metadata and attribute mapping
Run the Initial Crawl
Execute the first crawl to populate your knowledge base:
- Monitor the crawl progress
- Verify content extraction quality
- Address any issues or adjustments needed
Configure the RAG Agent
Set up an AI agent that uses the crawled content:
- Configure prompt templates
- Set retrieval parameters
- Define response formatting
- Test with sample questions
Deploy and Monitor
Make the agent available and keep it current:
- Publish the agent to AI Store
- Set up scheduled crawls for updates
- Monitor usage and performance
- Refine based on feedback
The result is an AI agent that can answer questions based on the content of the website, providing accurate and up-to-date information.
Best Practices
Follow these recommendations to get the most from the Crawler app:
Respectful Crawling
Respectful Crawling
Be a good web citizen:
- Respect robots.txt directives
- Implement appropriate rate limiting
- Identify your crawler with a meaningful user agent
- Crawl during off-peak hours when possible
- Only extract content you have permission to use
These practices help maintain good relationships with the websites you crawl.
Content Selection
Content Selection
Be selective about what you crawl:
- Focus on the most valuable content
- Use specific CSS selectors for precision
- Exclude boilerplate and repetitive content
- Be mindful of content that changes frequently
- Consider the information architecture of the site
Targeted crawling improves efficiency and content quality.
Incremental Updates
Incremental Updates
Optimize ongoing crawling:
- Use incremental updates when possible
- Schedule crawls based on content update frequency
- Focus on changed content rather than recrawling everything
- Implement change detection mechanisms
- Archive historical versions when appropriate
These approaches minimize resource usage while keeping content current.
Error Handling
Error Handling
Prepare for crawling challenges:
- Monitor for crawl failures
- Implement retries for transient errors.
- Have fallback mechanisms for critical content
- Set up notifications for persistent issues
- Regularly review crawl logs
Robust error handling ensures reliable content extraction.
Limitations and Considerations
When using the Crawler app, be aware of these considerations:
- JavaScript-Heavy Sites: Some websites rely heavily on JavaScript for content rendering. The Crawler includes JavaScript processing capabilities, but complex single-page applications might present challenges.
- Authentication Requirements: Websites requiring login can be crawled, but require additional configuration for authentication handling.
- Legal and Terms of Service: Always ensure you have the right to crawl and use the content from websites, respecting their terms of service.
- Dynamic Content: Content that changes based on user interaction or personalization may not be fully captured.
- Resource Intensity: Comprehensive crawling can be resource-intensive. Consider scope and frequency based on your needs and available resources.