Website Training Context

Automatically crawl your website to extract content that your AI assistant can reference when answering questions about your business.

Overview

Website crawling enables your AI to:

Answer questions based on your website content
Reference company information from your site
Provide accurate details from web pages
Stay current with website updates

How It Works

Crawling Process

Set Base URL: Configure your website's starting URL
Start Crawl: System begins crawling from base URL
Follow Links: System follows links within your domain
Extract Content: Text content is extracted from pages
Index Content: Content is indexed for quick retrieval
Make Available: Content is available to AI during conversations

What Gets Crawled

The system crawls:

All pages under your base URL domain
Only pages on the same domain (security)
Publicly accessible pages
Text content from HTML

What Doesn't Get Crawled

The system does NOT crawl:

Pages on different domains
Password-protected pages
Pages blocked by robots.txt
External links

Configuration

Base URL

The starting URL for website crawling:

Format: https://example.com or https://www.example.com

Requirements:

Must be a valid URL
Must be publicly accessible
Should be your main website URL

Example:

https://mycompany.com
https://www.mycompany.com
https://mycompany.com/about

Crawl Frequency

Website content can be re-crawled:

Manually: Click "Crawl" button when needed
Automatically: Set up scheduled crawls (if available)

When to Re-crawl:

After major website updates
When adding new pages
After content changes
Periodically to stay current

Crawl Process

Processing Time

Crawl time depends on:

Site size: Number of pages
Page complexity: Content amount per page
Server response: Website speed

Typical times:

Small sites (< 50 pages): 5-10 minutes
Medium sites (50-200 pages): 10-20 minutes
Large sites (200+ pages): 20-60 minutes

Background Processing

After pages are crawled, content is processed in the background:

Chunking: Content is split into manageable chunks
Embedding Generation: AI embeddings are created for each chunk
Storage: Chunks and embeddings are stored for retrieval

Processing happens asynchronously:

Pages are crawled immediately
Chunking and embedding happen in background jobs
Content becomes available to AI as processing completes
You can monitor job status via the Job Queue API

Typical processing times:

Small pages: 1-2 minutes per page
Medium pages: 2-5 minutes per page
Large pages: 5-10 minutes per page

What Gets Extracted

From each page, the system extracts:

Page title
Main content text
Headings and subheadings
List items
Paragraph text
Metadata (if available)

Content Indexing

Extracted content is:

Indexed for search
Made available to AI
Counted toward character limit
Stored for quick retrieval

Viewing Crawled Content

Content List

View all crawled pages showing:

Page title: Extracted page title
URL: Page URL
Character count: Extracted text length
Crawl date: When page was last crawled

Content Details

Click on a page to see:

Full extracted content
Page URL
Character count
Last crawl time

Character Limits

Counting

Character count includes:

All extracted text from all pages
Page titles and headings
Content from every crawled page

Managing Limits

To stay within limits:

Crawl only essential pages
Exclude pages with unnecessary content
Remove outdated crawled content
Upgrade plan for higher limits

Best Practices

Set Correct Base URL: Use your main website URL
Re-crawl Regularly: Keep content current with website updates
Monitor Character Usage: Check how much content is extracted
Exclude Unnecessary Pages: Focus on important content
Test After Changes: Re-crawl after major website updates

Troubleshooting

Crawl Fails

Verify base URL is accessible
Check website is not blocking crawlers
Ensure URL is correct format
Try accessing URL in browser

No Content Extracted

Check if pages have text content
Verify pages are not image-only
Ensure content is in HTML (not JavaScript-rendered)
Check robots.txt isn't blocking

Character Limit Exceeded

Delete old crawled content
Re-crawl with fewer pages
Focus on essential pages only
Upgrade plan for higher limits

Security Considerations

Domain Restriction

The crawler only accesses:

Pages on the same domain as base URL
Publicly accessible pages
No password-protected areas

Data Privacy

Only public content is crawled
No private or sensitive data should be on public pages
Crawled content is stored securely
Only your tenant can access your crawled content

Monitoring Processing

Check Job Status

Monitor background processing jobs:

Use the Job Queue API to check job status
View processing progress and completion
Identify any failed jobs that need attention

Processing Indicators

Pending: Job is queued, waiting to start
Processing: Job is actively being processed
Completed: Content is ready for AI use
Failed: Processing failed (check error details)

Next Steps

View Overview for general training context information
Learn about Products for product data management
Learn about Documents for document uploads
Learn about Instructions for custom instructions
Monitor processing with Job Queue API

Website Training Context ​

Overview ​

How It Works ​

Crawling Process ​

What Gets Crawled ​

What Doesn't Get Crawled ​

Configuration ​

Base URL ​

Crawl Frequency ​

Crawl Process ​

Processing Time ​

Background Processing ​

What Gets Extracted ​

Content Indexing ​

Viewing Crawled Content ​

Content List ​

Content Details ​

Character Limits ​

Counting ​

Managing Limits ​

Best Practices ​

Troubleshooting ​

Crawl Fails ​

No Content Extracted ​

Character Limit Exceeded ​

Security Considerations ​

Domain Restriction ​

Data Privacy ​

Monitoring Processing ​

Check Job Status ​

Processing Indicators ​

Next Steps ​