Website Crawl
Automatically extract content from your website pages
Overview
The website crawler automatically discovers and extracts content from multiple pages on your website. Instead of adding pages one by one, you can crawl your entire site (or specific sections) and have all the content added to your AI's knowledge base automatically.
The crawler follows links to discover pages, extracts text content, and processes everything so your AI can answer questions about your website.
Crawl Modes
You can choose between two crawl modes depending on your needs:
Automatic Mode
The crawler starts from your homepage and automatically discovers pages by following links. It also checks your sitemap.xml if available. Best for crawling your entire website or large sections of it.
All PlansManual Mode
You specify exact URLs to crawl (comma-separated). The crawler only visits those specific pages. Best when you only want certain pages added to your knowledge base.
Starter+Crawl Limits by Plan
The maximum number of pages you can crawl depends on your plan:
| Plan | Max Pages | Manual Mode | Password Protected |
|---|---|---|---|
| Free | 50 pages | ||
| Starter | 250 pages | ||
| Standard | 1,000 pages | ||
| Pro | 5,000 pages |
Password Protected Pages Starter+
Need to crawl pages behind a login? Enable the "Password Protected Pages" option to crawl members-only content, dashboards, or any password-protected areas of your website.
How to Use
- Enable the "Password Protected Pages" toggle on the crawl page
- Enter your Login Page URL (e.g.,
yoursite.com/login) - Enter your Username/Email and Password
- Click Start Crawling - the system will log in first, then crawl protected pages
How It Works
When you enable password protection, the crawler:
- Visits your login page and detects the form fields automatically
- Submits your credentials (including any CSRF tokens)
- Maintains the authenticated session while crawling
- Starts from where you're redirected after login (e.g., your dashboard)
- Discovers and crawls all protected pages it can find
Tip: The crawler automatically detects form fields (email, username, password) and security tokens, so it works with most login forms without additional configuration.
Advanced: Custom Field Names
If your login form uses non-standard field names, expand the "Advanced" section and specify:
- Username field name - The form field name for username/email (e.g.,
user_email) - Password field name - The form field name for password (e.g.,
user_pass)
Limitations: Password protected crawling works with standard HTML login forms. It may not work with:
- JavaScript-based logins (React, Vue, Angular single-page apps)
- CAPTCHA or reCAPTCHA protected logins
- Two-factor authentication (2FA)
- OAuth logins (Google, Facebook, etc.)
- Multi-step login flows
Tip: Instead of using your personal account, consider creating a dedicated account specifically for crawling. This lets you control exactly what the crawler can access.
Best Practices
Before Crawling
- Make sure your website is accessible and pages load correctly
- Check that important pages are linked from your homepage or sitemap
- For password-protected crawls, verify your credentials work
Choosing Pages
- Start with your most important pages - product pages, FAQs, services
- Use Manual mode if you only need specific pages
- Avoid crawling pages with outdated or inaccurate information
After Crawling
- Review the crawled content in your knowledge base
- Remove any irrelevant pages that were captured
- Test your AI with questions about the crawled content
- Re-crawl periodically to keep content up to date
Note: Each new crawl replaces the previous one for that website. Your AI will always use the most recently crawled content.
Troubleshooting
Crawl returns fewer pages than expected
- Pages might not be linked from discoverable pages
- Some pages might be blocked by robots.txt
- Cloudflare or other security services might block the crawler
- Solution: Use Manual mode to specify exact URLs
Password protected crawl fails
- Verify your credentials are correct
- Check if your login uses CAPTCHA or 2FA
- Try specifying custom field names in Advanced settings
- Your site might use JavaScript-based authentication (not supported)
Alternatives if crawling doesn't work:
- Temporarily make the pages public, crawl them, then re-enable protection
- Save the pages as HTML files and upload them via Bulk Upload
Content appears incomplete
- Some content might be loaded via JavaScript (not extracted)
- Content might be in images (not extracted as text)
- Solution: Add missing content manually via Text or PDF upload