Appearance
Website Training Context
Automatically crawl your website to extract content that your AI assistant can reference when answering questions about your business.
Overview
Website crawling enables your AI to:
- Answer questions based on your website content
- Reference company information from your site
- Provide accurate details from web pages
- Stay current with website updates
How It Works
Crawling Process
- Set Base URL: Configure your website's starting URL
- Start Crawl: System begins crawling from base URL
- Follow Links: System follows links within your domain
- Extract Content: Text content is extracted from pages
- Index Content: Content is indexed for quick retrieval
- Make Available: Content is available to AI during conversations
What Gets Crawled
The system crawls:
- All pages under your base URL domain
- Only pages on the same domain (security)
- Publicly accessible pages
- Text content from HTML
What Doesn't Get Crawled
The system does NOT crawl:
- Pages on different domains
- Password-protected pages
- Pages blocked by robots.txt
- External links
Configuration
Base URL
The starting URL for website crawling:
Format: https://example.com or https://www.example.com
Requirements:
- Must be a valid URL
- Must be publicly accessible
- Should be your main website URL
Example:
https://mycompany.comhttps://www.mycompany.comhttps://mycompany.com/about
Crawl Frequency
Website content can be re-crawled:
- Manually: Click "Crawl" button when needed
- Automatically: Set up scheduled crawls (if available)
When to Re-crawl:
- After major website updates
- When adding new pages
- After content changes
- Periodically to stay current
Crawl Process
Processing Time
Crawl time depends on:
- Site size: Number of pages
- Page complexity: Content amount per page
- Server response: Website speed
Typical times:
- Small sites (< 50 pages): 5-10 minutes
- Medium sites (50-200 pages): 10-20 minutes
- Large sites (200+ pages): 20-60 minutes
Background Processing
After pages are crawled, content is processed in the background:
- Chunking: Content is split into manageable chunks
- Embedding Generation: AI embeddings are created for each chunk
- Storage: Chunks and embeddings are stored for retrieval
Processing happens asynchronously:
- Pages are crawled immediately
- Chunking and embedding happen in background jobs
- Content becomes available to AI as processing completes
- You can monitor job status via the Job Queue API
Typical processing times:
- Small pages: 1-2 minutes per page
- Medium pages: 2-5 minutes per page
- Large pages: 5-10 minutes per page
What Gets Extracted
From each page, the system extracts:
- Page title
- Main content text
- Headings and subheadings
- List items
- Paragraph text
- Metadata (if available)
Content Indexing
Extracted content is:
- Indexed for search
- Made available to AI
- Counted toward character limit
- Stored for quick retrieval
Viewing Crawled Content
Content List
View all crawled pages showing:
- Page title: Extracted page title
- URL: Page URL
- Character count: Extracted text length
- Crawl date: When page was last crawled
Content Details
Click on a page to see:
- Full extracted content
- Page URL
- Character count
- Last crawl time
Character Limits
Counting
Character count includes:
- All extracted text from all pages
- Page titles and headings
- Content from every crawled page
Managing Limits
To stay within limits:
- Crawl only essential pages
- Exclude pages with unnecessary content
- Remove outdated crawled content
- Upgrade plan for higher limits
Best Practices
- Set Correct Base URL: Use your main website URL
- Re-crawl Regularly: Keep content current with website updates
- Monitor Character Usage: Check how much content is extracted
- Exclude Unnecessary Pages: Focus on important content
- Test After Changes: Re-crawl after major website updates
Troubleshooting
Crawl Fails
- Verify base URL is accessible
- Check website is not blocking crawlers
- Ensure URL is correct format
- Try accessing URL in browser
No Content Extracted
- Check if pages have text content
- Verify pages are not image-only
- Ensure content is in HTML (not JavaScript-rendered)
- Check robots.txt isn't blocking
Character Limit Exceeded
- Delete old crawled content
- Re-crawl with fewer pages
- Focus on essential pages only
- Upgrade plan for higher limits
Security Considerations
Domain Restriction
The crawler only accesses:
- Pages on the same domain as base URL
- Publicly accessible pages
- No password-protected areas
Data Privacy
- Only public content is crawled
- No private or sensitive data should be on public pages
- Crawled content is stored securely
- Only your tenant can access your crawled content
Monitoring Processing
Check Job Status
Monitor background processing jobs:
- Use the Job Queue API to check job status
- View processing progress and completion
- Identify any failed jobs that need attention
Processing Indicators
- Pending: Job is queued, waiting to start
- Processing: Job is actively being processed
- Completed: Content is ready for AI use
- Failed: Processing failed (check error details)
Next Steps
- View Overview for general training context information
- Learn about Products for product data management
- Learn about Documents for document uploads
- Learn about Instructions for custom instructions
- Monitor processing with Job Queue API

