Website Crawl

Automatically extract content from your website pages

Overview

The website crawler automatically discovers and extracts content from multiple pages on your website. Instead of adding pages one by one, you can crawl your entire site (or specific sections) and have all the content added to your AI's knowledge base automatically.

The crawler follows links to discover pages, extracts text content, and processes everything so your AI can answer questions about your website.

Crawl Modes

You can choose between two crawl modes depending on your needs:

Automatic Mode

The crawler starts from your homepage and automatically discovers pages by following links. It also checks your sitemap.xml if available. Best for crawling your entire website or large sections of it.

All Plans

Manual Mode

You specify exact URLs to crawl (comma-separated). The crawler only visits those specific pages. Best when you only want certain pages added to your knowledge base.

Starter+

Crawl Limits by Plan

The maximum number of pages you can crawl depends on your plan:

Plan	Max Pages	Manual Mode	Password Protected
Free	50 pages
Starter	250 pages
Standard	1,000 pages
Pro	5,000 pages

Password Protected Pages Starter+

Need to crawl pages behind a login? Enable the "Password Protected Pages" option to crawl members-only content, dashboards, or any password-protected areas of your website.

How to Use

Enable the "Password Protected Pages" toggle on the crawl page
Enter your Login Page URL (e.g., yoursite.com/login)
Enter your Username/Email and Password
Click Start Crawling - the system will log in first, then crawl protected pages

How It Works

When you enable password protection, the crawler:

Visits your login page and detects the form fields automatically
Submits your credentials (including any CSRF tokens)
Maintains the authenticated session while crawling
Starts from where you're redirected after login (e.g., your dashboard)
Discovers and crawls all protected pages it can find

Tip: The crawler automatically detects form fields (email, username, password) and security tokens, so it works with most login forms without additional configuration.

Advanced: Custom Field Names

If your login form uses non-standard field names, expand the "Advanced" section and specify:

Username field name - The form field name for username/email (e.g., user_email)
Password field name - The form field name for password (e.g., user_pass)

Limitations: Password protected crawling works with standard HTML login forms. It may not work with:

JavaScript-based logins (React, Vue, Angular single-page apps)
CAPTCHA or reCAPTCHA protected logins
Two-factor authentication (2FA)
OAuth logins (Google, Facebook, etc.)
Multi-step login flows

Tip: Instead of using your personal account, consider creating a dedicated account specifically for crawling. This lets you control exactly what the crawler can access.

Best Practices

Before Crawling

Make sure your website is accessible and pages load correctly
Check that important pages are linked from your homepage or sitemap
For password-protected crawls, verify your credentials work

Choosing Pages

Start with your most important pages - product pages, FAQs, services
Use Manual mode if you only need specific pages
Avoid crawling pages with outdated or inaccurate information

After Crawling

Review the crawled content in your knowledge base
Remove any irrelevant pages that were captured
Test your AI with questions about the crawled content
Re-crawl periodically to keep content up to date

Note: Each new crawl replaces the previous one for that website. Your AI will always use the most recently crawled content.

Troubleshooting

Crawl returns fewer pages than expected

Pages might not be linked from discoverable pages
Some pages might be blocked by robots.txt
Cloudflare or other security services might block the crawler
Solution: Use Manual mode to specify exact URLs

Password protected crawl fails

Verify your credentials are correct
Check if your login uses CAPTCHA or 2FA
Try specifying custom field names in Advanced settings
Your site might use JavaScript-based authentication (not supported)

Alternatives if crawling doesn't work:

Temporarily make the pages public, crawl them, then re-enable protection
Save the pages as HTML files and upload them via Bulk Upload

Content appears incomplete

Some content might be loaded via JavaScript (not extracted)
Content might be in images (not extracted as text)
Solution: Add missing content manually via Text or PDF upload

Documentation