Document Management

Documents are the foundation of your knowledge bases. This guide covers how to add, organize, and maintain your document collections.

Uploading Files

Drag and Drop

The simplest way to add files:

Open a knowledge base
Drag files from your computer onto the page
Drop them in the upload zone
Wait for processing

File Picker

Alternatively:

Click Upload Files
Select files from your computer
Click Open

Bulk Upload

For many files:

Drag a folder (supported in Chrome and Edge)
Select multiple files in the picker
Files are processed in parallel

Supported File Types

Category	Formats	Notes
Documents	PDF, DOCX, DOC, TXT, RTF	Text is extracted automatically
Presentations	PPTX, PPT	Slide content and notes
Spreadsheets	XLSX, XLS, CSV	Cell values and headers
Web	HTML, Markdown	Rendered content
Code	Most languages	Syntax-aware chunking

Maximum file size depends on your organization’s configuration. Typical limits are 50-100MB per file.

Images and Scans

For documents that are scanned images or contain images with text:

OCR is applied automatically to extract text
Quality depends on image clarity
Consider re-scanning poor quality documents

Adding Web Content

Single URLs

To add an individual web page:

Click Add URL
Enter the full URL (including https://)
Click Add

The page is fetched immediately and its content indexed.

Web Crawling

For multiple pages from a website:

Click Add Web Source
Enter the starting URL
Configure basic settings such as path filters, blacklisted patterns, sitemap mode, and XPath filtering
Click Start Crawling

The crawler discovers pages by following links and indexes their content. For the detailed workflow and current crawl settings, see Crawl a Website.

Advanced Crawler Settings

For more control, expand Hostname settings to configure:

Path filters - Only crawl specific sections (e.g., /docs/)
Blacklisted patterns - Skip low-value or duplicate URL patterns
Robots.txt - Keep the site’s crawling rules enabled by default
Sitemap mode - Only follow links from the sitemap
XPath filter - Extract only the useful page content
HTTP headers - Send custom headers to controlled internal sites

These settings can be configured per hostname if your source spans multiple domains.

Crawl Status

While crawling, a status banner shows:

Pages discovered, indexed, and skipped
Any errors encountered
Estimated completion

You can pause a crawl in progress and resume later.

Automatic Recrawling

Keep content fresh with scheduled recrawls:

Open the web source settings
Set Recrawl Schedule:
- Manual only
- Every 12 hours
- Daily
- Weekly
- Monthly
Save

The crawler checks for new and updated pages on schedule. Unchanged pages are skipped to save processing time.

Document Processing

When you add a document, several things happen:

1. Text Extraction

Content is extracted from the file format. This includes:

Body text
Headers and titles
Table content
Image captions (if available)
Metadata (author, date, etc.)

2. Chunking

Text is split into smaller pieces called chunks. This is necessary because:

Search works better with focused passages
AI models have context limits
Relevant information can be isolated

Default settings:

Chunk size: 512 tokens
Overlap: 50 tokens (consecutive chunks share context)

3. Embedding

Each chunk is converted to a vector (a list of numbers) using the embedding model. This enables semantic search - finding content by meaning, not just keywords.

4. Indexing

Chunks and their embeddings are stored in a vector database, ready for search.

Filtering Documents

Use the source filter to narrow the document list:

Filter	Shows
All	Everything in the knowledge base
Files	Uploaded documents only
Web	Crawled web pages only

Combine with search to quickly find specific documents.

Document Status

Each document shows a status that updates in real-time:

Status	Meaning
Queued	Waiting to be processed
Processing	Currently being extracted and indexed
Ready	Successfully processed and searchable
Error	Something went wrong during processing

Status changes appear automatically - no need to refresh the page. After uploading, watch as documents move from queued to processing to ready. Click an error status to see details. Common issues:

Unsupported format - File type not recognized
Password protected - Document is encrypted
Extraction failed - Content couldn’t be read
Too large - File exceeds size limit

Viewing Document Details

Click any document to see:

File information - Name, type, size, dates
Processing details - Chunk count, tokens, parser used
Chunks viewer - See exactly how the document was split

The Chunks Viewer

Understanding how documents are chunked helps debug retrieval issues:

Click View Chunks on any document
Browse through chunks (paginated for large documents)
Expand any chunk to see its full text
Search within the chunks to find specific content
Copy chunk text for testing or debugging

Each chunk shows:

Text content (expandable)
Page number (for PDFs)
Token count
Position in document

If important information spans multiple chunks poorly, consider adjusting chunk size or using a different chunking strategy in RAG Settings.

Document Tags

Organize documents with tags:

Select a document
Click Edit Tags
Add or remove tags
Save

Tags help with:

Filtering the document list
Finding specific content types
Organizing large collections

Updating Documents

To replace a document with a new version:

Delete the old document
Upload the new version

Or:

Click Reindex on the document
This re-processes the existing file

For frequently updated content, consider using connectors that sync automatically rather than manual uploads.

Deleting Documents

To remove a document:

Find it in the documents list
Click the delete icon (trash)
Confirm deletion

The document and all its chunks are removed. This affects search results immediately.

Bulk Deletion

To delete multiple documents:

Use filters to narrow the list
Select documents using checkboxes
Click Delete Selected
Confirm

Reindexing

When you change RAG settings (chunk size, embedding model, etc.), existing documents keep their old chunks. To apply new settings:

Single Document

Click Reindex on any document to reprocess it with current settings.

All Documents

To reindex the entire knowledge base:

Go to Settings
Scroll to Danger Zone
Click Reindex All Documents
Confirm

Reindexing large knowledge bases takes time and consumes processing resources. Documents remain searchable during reindexing, but results may be inconsistent until complete.

Best Practices

Use clean source documents

Well-formatted documents with clear headings produce better chunks and retrieval. Clean up messy documents before uploading.

Test with representative queries

After adding documents, test search in the Playground. Verify that relevant content is retrieved for typical questions.

Remove duplicates

Duplicate content hurts retrieval quality. If the same information appears in multiple documents, keep the most authoritative version.

Keep documents focused

Many focused documents are better than few giant documents. Split large documents by topic if they cover multiple subjects.

Use meaningful filenames

Filenames become part of the metadata and can help with retrieval. Use descriptive names, not “Document1.pdf”.

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Document Management

Uploading Files

Drag and Drop

File Picker

Bulk Upload

Supported File Types

Images and Scans

Adding Web Content

Single URLs

Web Crawling

Advanced Crawler Settings

Crawl Status

Automatic Recrawling

Document Processing

1. Text Extraction

2. Chunking

3. Embedding

4. Indexing

Filtering Documents

Document Status

Viewing Document Details

The Chunks Viewer

Document Tags

Updating Documents

Deleting Documents

Bulk Deletion

Reindexing

Single Document

All Documents

Best Practices

Next Steps

Connect external sources

Configure RAG settings

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Documentation Index

​Uploading Files

​Drag and Drop

​File Picker

​Bulk Upload

​Supported File Types

​Images and Scans

​Adding Web Content

​Single URLs

​Web Crawling

​Advanced Crawler Settings

​Crawl Status

​Automatic Recrawling

​Document Processing

​1. Text Extraction

​2. Chunking

​3. Embedding

​4. Indexing

​Filtering Documents

​Document Status

​Viewing Document Details

​The Chunks Viewer

​Document Tags

​Updating Documents

​Deleting Documents

​Bulk Deletion

​Reindexing

​Single Document

​All Documents

​Best Practices

​Next Steps

Connect external sources

Configure RAG settings

Uploading Files

Drag and Drop

File Picker

Bulk Upload

Supported File Types

Images and Scans

Adding Web Content

Single URLs

Web Crawling

Advanced Crawler Settings

Crawl Status

Automatic Recrawling

Document Processing

1. Text Extraction

2. Chunking

3. Embedding

4. Indexing

Filtering Documents

Document Status

Viewing Document Details

The Chunks Viewer

Document Tags

Updating Documents

Deleting Documents

Bulk Deletion

Reindexing

Single Document

All Documents

Best Practices

Next Steps