Skip to content

Website Training Context

Automatically crawl your website to extract content that your AI assistant can reference when answering questions about your business.

Overview

Website crawling enables your AI to:

  • Answer questions based on your website content
  • Reference company information from your site
  • Provide accurate details from web pages
  • Stay current with website updates

How It Works

Crawling Process

  1. Set Base URL: Configure your website's starting URL
  2. Start Crawl: System begins crawling from base URL
  3. Follow Links: System follows links within your domain
  4. Extract Content: Text content is extracted from pages
  5. Index Content: Content is indexed for quick retrieval
  6. Make Available: Content is available to AI during conversations

What Gets Crawled

The system crawls:

  • All pages under your base URL domain
  • Only pages on the same domain (security)
  • Publicly accessible pages
  • Text content from HTML

What Doesn't Get Crawled

The system does NOT crawl:

  • Pages on different domains
  • Password-protected pages
  • Pages blocked by robots.txt
  • External links

Configuration

Base URL

The starting URL for website crawling:

Format: https://example.com or https://www.example.com

Requirements:

  • Must be a valid URL
  • Must be publicly accessible
  • Should be your main website URL

Example:

  • https://mycompany.com
  • https://www.mycompany.com
  • https://mycompany.com/about

Crawl Frequency

Website content can be re-crawled:

  • Manually: Click "Crawl" button when needed
  • Automatically: Set up scheduled crawls (if available)

When to Re-crawl:

  • After major website updates
  • When adding new pages
  • After content changes
  • Periodically to stay current

Crawl Process

Processing Time

Crawl time depends on:

  • Site size: Number of pages
  • Page complexity: Content amount per page
  • Server response: Website speed

Typical times:

  • Small sites (< 50 pages): 5-10 minutes
  • Medium sites (50-200 pages): 10-20 minutes
  • Large sites (200+ pages): 20-60 minutes

Background Processing

After pages are crawled, content is processed in the background:

  • Chunking: Content is split into manageable chunks
  • Embedding Generation: AI embeddings are created for each chunk
  • Storage: Chunks and embeddings are stored for retrieval

Processing happens asynchronously:

  • Pages are crawled immediately
  • Chunking and embedding happen in background jobs
  • Content becomes available to AI as processing completes
  • You can monitor job status via the Job Queue API

Typical processing times:

  • Small pages: 1-2 minutes per page
  • Medium pages: 2-5 minutes per page
  • Large pages: 5-10 minutes per page

What Gets Extracted

From each page, the system extracts:

  • Page title
  • Main content text
  • Headings and subheadings
  • List items
  • Paragraph text
  • Metadata (if available)

Content Indexing

Extracted content is:

  • Indexed for search
  • Made available to AI
  • Counted toward character limit
  • Stored for quick retrieval

Viewing Crawled Content

Content List

View all crawled pages showing:

  • Page title: Extracted page title
  • URL: Page URL
  • Character count: Extracted text length
  • Crawl date: When page was last crawled

Content Details

Click on a page to see:

  • Full extracted content
  • Page URL
  • Character count
  • Last crawl time

Character Limits

Counting

Character count includes:

  • All extracted text from all pages
  • Page titles and headings
  • Content from every crawled page

Managing Limits

To stay within limits:

  • Crawl only essential pages
  • Exclude pages with unnecessary content
  • Remove outdated crawled content
  • Upgrade plan for higher limits

Best Practices

  1. Set Correct Base URL: Use your main website URL
  2. Re-crawl Regularly: Keep content current with website updates
  3. Monitor Character Usage: Check how much content is extracted
  4. Exclude Unnecessary Pages: Focus on important content
  5. Test After Changes: Re-crawl after major website updates

Troubleshooting

Crawl Fails

  • Verify base URL is accessible
  • Check website is not blocking crawlers
  • Ensure URL is correct format
  • Try accessing URL in browser

No Content Extracted

  • Check if pages have text content
  • Verify pages are not image-only
  • Ensure content is in HTML (not JavaScript-rendered)
  • Check robots.txt isn't blocking

Character Limit Exceeded

  • Delete old crawled content
  • Re-crawl with fewer pages
  • Focus on essential pages only
  • Upgrade plan for higher limits

Security Considerations

Domain Restriction

The crawler only accesses:

  • Pages on the same domain as base URL
  • Publicly accessible pages
  • No password-protected areas

Data Privacy

  • Only public content is crawled
  • No private or sensitive data should be on public pages
  • Crawled content is stored securely
  • Only your tenant can access your crawled content

Monitoring Processing

Check Job Status

Monitor background processing jobs:

  • Use the Job Queue API to check job status
  • View processing progress and completion
  • Identify any failed jobs that need attention

Processing Indicators

  • Pending: Job is queued, waiting to start
  • Processing: Job is actively being processed
  • Completed: Content is ready for AI use
  • Failed: Processing failed (check error details)

Next Steps

autoch.at Documentation