Website Crawl

Automatically extract content from your website pages

Start Crawling

Overview

The website crawler automatically discovers and extracts content from multiple pages on your website. Instead of adding pages one by one, you can crawl your entire site (or specific sections) and have all the content added to your AI's knowledge base automatically.

The crawler follows links to discover pages, extracts text content, and processes everything so your AI can answer questions about your website.

Crawl Modes

You can choose between two crawl modes depending on your needs:

Automatic Mode

The crawler starts from your homepage and automatically discovers pages by following links. It also checks your sitemap.xml if available. Best for crawling your entire website or large sections of it.

All Plans

Manual Mode

You specify exact URLs to crawl (comma-separated). The crawler only visits those specific pages. Best when you only want certain pages added to your knowledge base.

Crawl Limits by Plan

The maximum number of pages you can crawl depends on your plan:

Plan Max Pages Manual Mode Password Protected
Free 50 pages
Starter 250 pages
Standard 1,000 pages
Pro 5,000 pages

Password Protected Pages

Need to crawl pages behind a login? Enable the "Password Protected Pages" option to crawl members-only content, dashboards, or any password-protected areas of your website.

How to Use

  1. Enable the "Password Protected Pages" toggle on the crawl page
  2. Enter your Login Page URL (e.g., yoursite.com/login)
  3. Enter your Username/Email and Password
  4. Click Start Crawling - the system will log in first, then crawl protected pages

How It Works

When you enable password protection, the crawler:

  1. Visits your login page and detects the form fields automatically
  2. Submits your credentials (including any CSRF tokens)
  3. Maintains the authenticated session while crawling
  4. Starts from where you're redirected after login (e.g., your dashboard)
  5. Discovers and crawls all protected pages it can find

Tip: The crawler automatically detects form fields (email, username, password) and security tokens, so it works with most login forms without additional configuration.

Advanced: Custom Field Names

If your login form uses non-standard field names, expand the "Advanced" section and specify:

  • Username field name - The form field name for username/email (e.g., user_email)
  • Password field name - The form field name for password (e.g., user_pass)

Limitations: Password protected crawling works with standard HTML login forms. It may not work with:

  • JavaScript-based logins (React, Vue, Angular single-page apps)
  • CAPTCHA or reCAPTCHA protected logins
  • Two-factor authentication (2FA)
  • OAuth logins (Google, Facebook, etc.)
  • Multi-step login flows

Tip: Instead of using your personal account, consider creating a dedicated account specifically for crawling. This lets you control exactly what the crawler can access.

Best Practices

Before Crawling

  • Make sure your website is accessible and pages load correctly
  • Check that important pages are linked from your homepage or sitemap
  • For password-protected crawls, verify your credentials work

Choosing Pages

  • Start with your most important pages - product pages, FAQs, services
  • Use Manual mode if you only need specific pages
  • Avoid crawling pages with outdated or inaccurate information

After Crawling

  • Review the crawled content in your knowledge base
  • Remove any irrelevant pages that were captured
  • Test your AI with questions about the crawled content
  • Re-crawl periodically to keep content up to date

Note: Each new crawl replaces the previous one for that website. Your AI will always use the most recently crawled content.

Troubleshooting

Crawl returns fewer pages than expected

  • Pages might not be linked from discoverable pages
  • Some pages might be blocked by robots.txt
  • Cloudflare or other security services might block the crawler
  • Solution: Use Manual mode to specify exact URLs

Password protected crawl fails

  • Verify your credentials are correct
  • Check if your login uses CAPTCHA or 2FA
  • Try specifying custom field names in Advanced settings
  • Your site might use JavaScript-based authentication (not supported)

Alternatives if crawling doesn't work:

  • Temporarily make the pages public, crawl them, then re-enable protection
  • Save the pages as HTML files and upload them via Bulk Upload

Content appears incomplete

  • Some content might be loaded via JavaScript (not extracted)
  • Content might be in images (not extracted as text)
  • Solution: Add missing content manually via Text or PDF upload