Extract and process web content for AI knowledge bases and automations
Configuration
Discovery
Extraction
Processing
Storage
Target websites
/tag/love/
path : in that case, only https://quotes.toscrape.com/tag/love/page/2/ will be discovered and other found url will be ignored.URL blacklist
/recherche
for www.issy-tourisme-international.com/publications
for issy.com and with a id=
query string parameterContent xpath filter
<main>
HTML tags./html/body/descendant::text()[not(ancestor::style) and not(ancestor::script) and not(ancestor::header) and not(ancestor::footer) and {xpath_filter}]
Sitemap crawling
websiteURL
ending with sitemap.xml to only crawl the URLs listed in that sitemap :Scheduling Configuration
Content extraction methods
docs.prisme.ai
.Possible values are:unstructured
(default) or docling
xpath
(default) or docling
docling
option will return a markdown formatted body
for the documents and html, while unstructured and xpath will be plain text with no specific structure.docling
is slower to process documents (and more resource intensive) than unstructured
, but on-par for html with xpath
.Configure the Crawler
Create an AI Knowledge Project
Connect the Crawler to AI Knowledge
Run the Initial Crawl
Configure the RAG Agent
Deploy and Monitor
Respectful Crawling
Content Selection
Incremental Updates
Error Handling