Website Crawl

Automatically extract content from your website pages

Start Crawling

Overview

The website crawler automatically discovers and extracts content from multiple pages on your website. Instead of adding pages one by one, you can crawl your entire site (or specific sections) and have all the content added to your AI's knowledge base automatically.

The crawler follows links to discover pages, extracts text content, and processes everything so your AI can answer questions about your website.

Crawl Modes

You can choose between two crawl modes depending on your needs:

Automatic Mode

The crawler starts from your homepage and automatically discovers pages by following links. It also checks your sitemap.xml if available. Best for crawling your entire website or large sections of it.

All Plans

Manual Mode

You specify exact URLs to crawl (comma-separated). The crawler only visits those specific pages. Best when you only want certain pages added to your knowledge base.

Crawl Limits by Plan

The maximum number of pages you can crawl depends on your plan:

Plan Max Pages Manual Mode Password Protected
Free 50 pages
Starter 250 pages
Standard 1,000 pages
Pro 5,000 pages

Password Protected Pages

Need to crawl pages behind a login? Enable the "Password Protected Pages" option to crawl members-only content, dashboards, or any password-protected areas of your website.

How to Use

  1. Enable the "Password Protected Pages" toggle on the crawl page
  2. Enter your Login Page URL (e.g., yoursite.com/login)
  3. Enter your Username/Email and Password
  4. Click Start Crawling - the system will log in first, then crawl protected pages

How It Works

When you enable password protection, the crawler:

  1. Visits your login page and detects the form fields automatically
  2. Submits your credentials (including any CSRF tokens)
  3. Maintains the authenticated session while crawling
  4. Starts from where you're redirected after login (e.g., your dashboard)
  5. Discovers and crawls all protected pages it can find

Tip: The crawler automatically detects form fields (email, username, password) and security tokens, so it works with most login forms without additional configuration.

Advanced: Custom Field Names

If your login form uses non-standard field names, expand the "Advanced" section and specify:

  • Username field name - The form field name for username/email (e.g., user_email)
  • Password field name - The form field name for password (e.g., user_pass)

Limitations: Password protected crawling works with standard HTML login forms. It may not work with:

  • JavaScript-based logins (React, Vue, Angular single-page apps)
  • CAPTCHA or reCAPTCHA protected logins
  • Two-factor authentication (2FA)
  • OAuth logins (Google, Facebook, etc.)
  • Multi-step login flows

Tip: Instead of using your personal account, consider creating a dedicated account specifically for crawling. This lets you control exactly what the crawler can access.

Best Practices

Before Crawling

  • Make sure your website is accessible and pages load correctly
  • Check that important pages are linked from your homepage or sitemap
  • For password-protected crawls, verify your credentials work

Choosing Pages

  • Start with your most important pages - product pages, FAQs, services
  • Use Manual mode if you only need specific pages
  • Avoid crawling pages with outdated or inaccurate information

After Crawling

  • Review the crawled content in your knowledge base
  • Remove any irrelevant pages that were captured
  • Test your AI with questions about the crawled content
  • Re-crawl periodically to keep content up to date

Note: Each new crawl replaces the previous one for that website. Your AI will always use the most recently crawled content.

Managing Crawled Pages

After a crawl completes, you can preview and manage individual pages from the Knowledge Base section on your Dashboard.

Previewing Page Content

  1. Go to your Dashboard and open the Knowledge Base section
  2. Click on a crawl item to open it — you'll see a list of all crawled pages
  3. Click any page title to preview its extracted content
  4. Use the Back to pages button to return to the page list

Tip: Previewing pages is a great way to verify the crawler extracted the right content. If a page looks wrong, you can edit it directly or delete it and add the content manually instead.

Editing Individual Pages

You can edit the extracted content of any crawled page. This is useful for fixing formatting issues, removing irrelevant sections, or adding missing information.

  1. Open the crawl item and click a page title to view its content
  2. Click the Edit button at the top of the preview
  3. Modify the title or content as needed
  4. Click Save & Re-embed — the page's AI embeddings will be regenerated with the updated content

Note: Editing a page only re-embeds that specific page, not the entire crawl. Your other crawled pages are unaffected.

Re-crawling Individual Pages

If a page on your website has been updated, you can re-crawl just that page without re-crawling your entire website.

  1. Open the crawl item from your Knowledge Base
  2. Click the re-crawl button next to the page you want to update
  3. Confirm — the page will be re-fetched and its embeddings updated with the latest content

Tip: This is great for keeping individual pages up to date after content changes, without having to re-crawl hundreds of pages.

Deleting Individual Pages

You can remove specific pages from a crawl without deleting the entire crawl. This is useful for removing irrelevant, duplicate, or incorrectly crawled pages.

  1. Open the crawl item from your Knowledge Base
  2. Click the delete button next to the page you want to remove
  3. Confirm the deletion — the page and its embeddings will be permanently removed

Note: If you delete all pages from a crawl, the entire crawl entry will be automatically removed from your knowledge base.

Troubleshooting

Crawl returns fewer pages than expected

  • Pages might not be linked from discoverable pages
  • Some pages might be blocked by robots.txt
  • Cloudflare or other security services might block the crawler
  • Solution: Use Manual mode to specify exact URLs

Password protected crawl fails

  • Verify your credentials are correct
  • Check if your login uses CAPTCHA or 2FA
  • Try specifying custom field names in Advanced settings
  • Your site might use JavaScript-based authentication (not supported)

Alternatives if crawling doesn't work:

  • Temporarily make the pages public, crawl them, then re-enable protection
  • Save the pages as HTML files and upload them via Bulk Upload

Content appears incomplete

  • Some content might be loaded via JavaScript (not extracted)
  • Content might be in images (not extracted as text)
  • Solution: Add missing content manually via Text or PDF upload