Crawl a Website

Website crawling turns a public website or a section of a website into searchable documents. Use it when an agent needs answers from multiple pages that change over time, such as documentation, help centers, public policies, or product pages.

Knowledges base with configured web sources and crawled pages

Choose the right ingestion method

Method	Use it when
Single URL	You only need one page, such as one policy, article, or release note.
Auto-Discovery	You need Knowledges to follow links and index several pages from the same site.
Connector	The content lives in an authenticated source such as SharePoint or Google Drive.
File upload	You already have exported files such as PDFs, DOCX, CSV, or HTML.

For a normal website crawl, use Auto-Discovery.

Before you start

Prepare the crawl scope before opening Knowledges:

Pick the smallest useful starting URL. Prefer https://docs.example.com/product over the website homepage when only one section matters.
Decide which paths must be included and excluded.
Check whether the site publishes a sitemap.
Confirm that the content is accessible to the Prisme.ai platform from the network where it runs.
Avoid crawling private or sensitive content unless the knowledge base sharing rules are already configured.

The crawler follows the target website’s robots.txt rules by default. Keep this enabled unless your organization explicitly controls the crawled site and allows a different behavior.

Start a website crawl

Open Knowledges.
Open the knowledge base that should receive the website pages.
Click Add Web Source.
Select Auto-Discovery.
Enter the starting URL.
Open Hostname settings if you need to narrow the crawl.
Click Start Crawling.

Auto-Discovery configuration for an Knowledges website crawl

The crawl starts in the background. Knowledges creates one document per indexed page and processes each page like any other document: extraction, chunking, embeddings, then indexing.

Configure the crawl

Use the hostname settings to control what gets indexed.

Setting	What it does	Example
Path filter	Limits discovery to one section of the site.	`/docs` or `/self-hosting`
Blacklisted patterns	Skips URLs that match a pattern.	`/admin/.`, `/search.`, `/reference/.*`
Respect robots.txt	Follows the site’s crawling rules.	Keep enabled for third-party sites.
Sitemap only	Crawls only URLs listed in the sitemap.	Use for large or well-structured sites.
XPath filter	Extracts only the useful page content.	`ancestor::main`
HTTP headers	Sends custom headers to the target host.	Use for controlled internal sites that require a header.

Scope by path

Path filters are the first protection against noisy crawls. If the site has product docs under /docs, do not start from / unless the whole website is useful. Good scope:

https://docs.example.com/product
Path filter: /product

Noisy scope:

https://www.example.com
Path filter: /

Exclude low-value pages

Add blacklisted patterns for pages that create duplicates or poor retrieval context:

Search pages
Login and account pages
Tag archives
Generated API reference pages when they are not useful to the agent
Large legal or navigation sections unrelated to the use case

Use sitemap-only mode for large sites

If the site exposes a sitemap, sitemap-only mode is usually more predictable than link discovery. It avoids crawling pages that are linked only from navigation, filters, or dynamic widgets.

Use an XPath filter when pages contain boilerplate

Use XPath filtering when indexed chunks contain menus, footers, cookie banners, or repeated navigation text. A common starting point is:

ancestor::main

This keeps text located under the page’s main content area and removes most surrounding layout.

Monitor indexing

After the crawl starts, the knowledge base shows:

The configured web source.
The path, page limit, recrawl mode, JavaScript mode, and robots.txt mode.
Each discovered page as a document.
The processing status for each page.

Common statuses:

Status	Meaning
Processing	The page has been discovered and is being extracted or embedded.
Ready	The page is indexed and searchable.
Failed	Extraction or indexing failed. Open the row to inspect the error.

You can use Web Pages to focus the document list on crawled content only.

Keep the crawl current

Use Re-crawl when the source website changes and the knowledge base should refresh its pages. Re-crawl after:

A documentation release.
A website migration.
A large content update.
Changes to path filters, blacklist patterns, XPath filters, or headers.

For large sources, start with a small path and verify the output before broadening the crawl.

Test retrieval quality

Once a few pages are Ready:

Open the agent or test interface that uses the knowledge base.
Ask a precise question whose answer exists on a crawled page.
Check the returned sources.
If the right page is not retrieved, inspect the page chunks in the knowledge base.
Adjust path filters, XPath filters, or chunking settings, then re-crawl.

Test with questions that mention real page terms and questions that use user vocabulary. A good crawl should support both exact and semantic retrieval.

Troubleshooting

No pages are indexed

Check that the URL is reachable from the Prisme.ai platform, the path filter is not too restrictive, robots.txt allows crawling, and the page links are normal links that the crawler can discover.

Too many irrelevant pages are indexed

Narrow the starting URL, add a path filter, add blacklisted patterns, or switch to sitemap-only mode if the site has a sitemap.

Pages are indexed but answers are poor

Inspect chunks from a Ready page. If chunks contain navigation, footers, or repeated boilerplate, add an XPath filter and re-crawl.

Some pages fail with empty content

The page may be empty, blocked, unsupported, or mostly rendered in a way the crawler cannot extract. Try a more specific page, check the page source, and add an XPath filter when useful content is present but mixed with layout.

The crawl includes old or duplicate pages

Add blacklisted patterns for archives, query parameters, print pages, and alternate routes. Then re-crawl the source.

Document management

Manage uploaded files, single URLs, and crawled pages.

RAG settings

Tune chunking and retrieval after pages are indexed.

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Choose the right ingestion method

Before you start

Start a website crawl

Configure the crawl

Scope by path

Exclude low-value pages

Use sitemap-only mode for large sites

Use an XPath filter when pages contain boilerplate

Monitor indexing

Keep the crawl current

Test retrieval quality

Troubleshooting

Document management

RAG settings

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Documentation Index

​Choose the right ingestion method

​Before you start

​Start a website crawl

​Configure the crawl

​Scope by path

​Exclude low-value pages

​Use sitemap-only mode for large sites

​Use an XPath filter when pages contain boilerplate

​Monitor indexing

​Keep the crawl current

​Test retrieval quality

​Troubleshooting

​Related guides

Document management

RAG settings

Choose the right ingestion method

Before you start

Start a website crawl

Configure the crawl

Scope by path

Exclude low-value pages

Use sitemap-only mode for large sites

Use an XPath filter when pages contain boilerplate

Monitor indexing

Keep the crawl current

Test retrieval quality

Troubleshooting

Related guides