Crawler
Extract and process web content for AI knowledge bases and automations
The Crawler app is a powerful infrastructure component in the Prisme.ai ecosystem that enables you to extract, process, and utilize content from websites. It transforms web content and documents (pdf, ppt…) into structured data that can be used in your AI solutions, knowledge bases, and automations.
Overview
The Crawler is a specialized microservice provided by Prisme.ai that handles the complex process of web content extraction:
Web Content Extraction
Automatically extract content from websites and web pages
Content Processing
Transform web content and documents into structured, usable data
Selective Crawling
Target specific content through URL patterns and CSS selectors
Scheduling
Set up regular crawling jobs to keep information current
This infrastructure app is particularly valuable for creating AI knowledge bases, maintaining up-to-date information, and automating content-based workflows.
Key Features
Extract various types of content from websites:
- Text Content: Articles, documentation, product information
- Structured Data: Tables, lists, and other formatted content
- Metadata: Page titles, descriptions, authors, dates
- Navigation Structure: Site hierarchies and relationships
- Links and References: Internal and external connections
The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.
Extract various types of content from websites:
- Text Content: Articles, documentation, product information
- Structured Data: Tables, lists, and other formatted content
- Metadata: Page titles, descriptions, authors, dates
- Navigation Structure: Site hierarchies and relationships
- Links and References: Internal and external connections
The Crawler uses advanced techniques to identify and extract meaningful content while filtering out navigation elements, advertisements, and other non-essential components.
Control exactly what content gets crawled:
- URL Patterns: Include or exclude content based on URL patterns
- CSS Selectors: Target specific page elements by CSS selectors
- Content Types: Filter by content type (text, images, etc.)
- Depth Control: Limit crawling to a specific number of levels
- Rate Limiting: Control crawl speed to be respectful of websites
- **Periodicity **: Control the periodicity of crawling
These targeting options allow you to focus on the most relevant content for your needs.
Transform extracted content for better usability:
- Content Cleaning: Remove boilerplate text and formatting
- Text Extraction: Convert HTML to clean, usable text
- Structure Preservation: Maintain headings, lists, and tables
- Metadata Extraction: Capture page properties and attributes
- Page Transformation: transform web to markdown (LLM freindly format)
These processing capabilities ensure that the extracted content is ready for use in your AI solutions.
Keep your content current with scheduling options:
- Recurring Crawls: Schedule regular updates
- Incremental Crawling: Focus on new or changed content
- Event-Triggered Crawls: Start crawls based on specific events
- Notification System: Get alerts about crawl status
These scheduling features help maintain the freshness of your knowledge base.
How the Crawler Works
The Crawler follows a systematic process to extract and process web content:
Configuration
Define what to crawl and how to process it:
- Specify starting URLs
- Set URL patterns for inclusion/exclusion
- Define content selectors for authentication
- Define periodicity
Discovery
Start with initial URLs and discover additional content:
- Visit starting pages
- Identify links to follow
- Filter links based on patterns
- Build a crawl frontier
Extraction
Visit pages and extract content:
- Render pages (including JavaScript content)
- Apply CSS selectors to target specific content
- Extract text, structured data, and metadata
- Process and clean the extracted content
Processing
Transform extracted content into usable formats:
- Clean and normalize text
- Identify and preserve structure (headings, lists, etc.)
- Extract metadata and attributes
- Categorize and classify content
Storage
Store processed content for use in your applications:
- Save to structured storage
- Index for search and retrieval
- Associate with metadata and source information
- Make available for knowledge bases and automations
This process transforms web content into structured, searchable information that can power your AI applications.
Configuration Options
The Crawler app provides an extensive configuration to tailor its behavior to your specific needs, domain by domain.
Common Use Cases
The Crawler app enables a wide range of use cases:
Knowledge Base Creation
Build comprehensive AI knowledge bases from website content:
- Documentation portals
- Product information sites
- Support knowledge bases
- Internal wikis
Content Monitoring
Keep track of changes and updates on important websites:
- Competitor websites
- Industry news sources
- Regulatory publications
- Product documentation
Data Collection
Gather structured data from web sources:
- Product catalogs
- Price information
- Company directories
- Research publications
Website to RAG Agent
Transform websites into conversational AI agents:
- Company websites
- Documentation portals
- Educational resources
- Knowledge repositories
Integration with Other Prisme.ai Products
The Crawler app works seamlessly with other Prisme.ai products:
The Crawler is a primary data source for AI Knowledge:
- Extract web content for knowledge bases
- Keep information current through scheduled crawls
- Process and structure content for optimal retrieval
- Preserve source attribution for transparency
This integration enables the creation of AI agents that can answer questions based on website content.
The Crawler is a primary data source for AI Knowledge:
- Extract web content for knowledge bases
- Keep information current through scheduled crawls
- Process and structure content for optimal retrieval
- Preserve source attribution for transparency
This integration enables the creation of AI agents that can answer questions based on website content.
Use the Crawler in your automation workflows:
- Trigger automations based on website changes
- Process and transform web content
- Extract specific data for decision-making
- Integrate web content with other data sources
This enables sophisticated automations that leverage web data.
Combine the Crawler with Custom Code for advanced processing:
- Apply custom transformations to extracted content
- Implement specialized parsing logic
- Analyze content with custom algorithms
- Generate derived insights from crawled data
This combination provides maximum flexibility for handling web content.
Store crawled content in Collection for persistence and querying:
- Save structured content for later use
- Build queryable repositories of web information
- Track changes over time
- Combine with other data sources
This integration provides persistent storage and retrieval capabilities for crawled content.
Example: Creating a Website RAG Agent
One of the most popular use cases for the Crawler is creating a Retrieval-Augmented Generation (RAG) agent based on website content:
Configure the Crawler
Set up the Crawler to extract content from the target website:
Create an AI Knowledge Project
Set up a new project in AI Knowledge to house the crawled content:
- Create a new project
- Configure embedding and chunk settings
- Set up appropriate processing options
Connect the Crawler to AI Knowledge
Configure the connection between the Crawler and AI Knowledge:
- Select the crawler as a data source
- Map crawled content to knowledge base structure
- Configure metadata and attribute mapping
Run the Initial Crawl
Execute the first crawl to populate your knowledge base:
- Monitor the crawl progress
- Verify content extraction quality
- Address any issues or adjustments needed
Configure the RAG Agent
Set up an AI agent that uses the crawled content:
- Configure prompt templates
- Set retrieval parameters
- Define response formatting
- Test with sample questions
Deploy and Monitor
Make the agent available and keep it current:
- Publish the agent to AI Store
- Set up scheduled crawls for updates
- Monitor usage and performance
- Refine based on feedback
The result is an AI agent that can answer questions based on the content of the website, providing accurate and up-to-date information.
Best Practices
Follow these recommendations to get the most from the Crawler app:
Limitations and Considerations
When using the Crawler app, be aware of these considerations:
-
JavaScript-Heavy Sites: Some websites rely heavily on JavaScript for content rendering. The Crawler includes JavaScript processing capabilities, but complex single-page applications might present challenges.
-
Authentication Requirements: Websites requiring login can be crawled, but require additional configuration for authentication handling.
-
Legal and Terms of Service: Always ensure you have the right to crawl and use the content from websites, respecting their terms of service.
-
Dynamic Content: Content that changes based on user interaction or personalization may not be fully captured.
-
Resource Intensity: Comprehensive crawling can be resource-intensive. Consider scope and frequency based on your needs and available resources.
Next Steps
Was this page helpful?