✨ Train your first AI chatbot free — no credit card neededStart free →
Alee
← All tutorials
Tutorial · 8 min read

Bulk-Train Your Chatbot With a Sitemap

Train your Alee chatbot on a whole site at once using a sitemap.xml or many URLs. Choose pages, manage large sites, keep the brain fresh.

Adding pages one URL at a time is fine when you have five of them. When your site has fifty, two hundred, or two thousand pages, you want Alee to swallow the whole thing in one go. That is exactly what sitemap and bulk-URL training is for: you point Alee at your sitemap.xml (or paste a list of URLs), it crawls every page, chunks the text, and builds your knowledge brain so the bot can answer from your entire site.

This guide walks through the whole workflow, every option you get along the way, and the practical decisions for big sites.

What "bulk training" actually does

When you add a sitemap or a list of URLs, Alee does the same thing it does for a single page, just at scale. For each URL it:

  1. Fetches the page and pulls out the readable text (it ignores nav menus, footers, and scripts).
  2. Splits that text into chunks.
  3. Turns each chunk into a vector embedding and stores it in your bot's pgvector index, the "knowledge brain".

Later, when a visitor asks a question, Alee embeds the question, finds the closest chunks across everything you ingested, and an LLM writes a grounded answer with sources. If the answer is not in your content, the bot says it does not know rather than making something up. This is the Advanced RAG method, and it works the same whether the brain came from one page or two thousand.

The takeaway: a sitemap import is not a different feature. It is the fastest way to fill the brain.

Before you start: find your sitemap

Most sites already publish a sitemap. Common locations:

  • https://yoursite.com/sitemap.xml
  • https://yoursite.com/sitemap_index.xml (a sitemap that links to other sitemaps)
  • https://yoursite.com/robots.txt often lists the sitemap URL at the bottom

Platform notes:

  • WordPress with Yoast or Rank Math: usually sitemap_index.xml, which fans out into post-sitemap.xml, page-sitemap.xml, and so on.
  • Shopify: /sitemap.xml is generated automatically and splits into products, collections, pages, and blogs.
  • Wix, Squarespace, Webflow, Framer, Ghost: all auto-generate a sitemap at /sitemap.xml.

If your site has no sitemap, do not worry. You can still bulk-import by pasting a list of URLs, or point Alee at a single hub page (like your docs index) and add the pages it links to.

Step-by-step: train from a sitemap

  1. Open the bot you want to train and go to its Sources area (the place where you add knowledge). Click to add a new source.
  2. Choose the option for a sitemap or many pages rather than a single URL.
  3. Paste your sitemap URL, for example https://yoursite.com/sitemap.xml. If you have a sitemap index, paste that, Alee follows the child sitemaps.
  4. Alee fetches the sitemap and shows you the list of discovered URLs. This is your chance to review before anything is ingested.
  5. Select which pages to include. Tick the ones you want, or select all. Deselect anything you would not want the bot quoting (more on that below).
  6. Confirm and start the crawl. Alee works through the URLs, crawling and embedding each one.
  7. Watch the status as pages move from queued to done. Large jobs run in the background, so you can close the tab and come back.

Once the run finishes, your brain is live. Open the bot's preview and ask a question that should be answered by a deep page to confirm the new content is retrievable.

Training from a list of URLs instead

No sitemap, or you only want specific pages? Use the bulk-URL option:

  1. In the same add-source flow, choose to add multiple URLs.
  2. Paste your URLs, one per line.
  3. Review the list and start the crawl.

This is the cleanest method when you want tight control, for example ingesting only your 30 best help-desk articles instead of the entire blog archive.

Choosing which pages to ingest

More pages is not automatically better. The brain answers best when it is full of useful, answerable content and not padded with noise. When you review the discovered URL list, include:

  • Product and pricing pages
  • Service and feature descriptions
  • FAQ and help/support articles
  • Documentation and how-to guides
  • Important blog posts that explain your offering
  • About, policies, shipping, returns, and contact info

Deselect or skip:

  • Tag, category, and pagination pages (/tag/, /page/2/) — they are mostly lists of links with little real text.
  • Author archives and date archives.
  • Cart, checkout, login, and account pages.
  • Thank-you and confirmation pages.
  • Thin landing pages that are pure design with almost no copy.
  • Anything outdated or contradictory. Conflicting pages make the bot's answers wobble.

A good rule: if a human could not answer a customer question by reading that page, the bot cannot either. Leave it out.

Managing large sites

Big sites need a little strategy so the brain stays sharp and your plan's limits are respected.

Start with a focused slice

You rarely need all 2,000 pages on day one. Import the sections that drive the most questions first, usually pricing, FAQ, docs, and top products. Test the bot, see what it answers well, then add more sections in later passes. The brain grows incrementally; you can add sources any time.

Use the right sitemap, not the index

If your sitemap index links to ten child sitemaps, you can often add just the ones you care about, for example page-sitemap.xml and product-sitemap.xml, and skip post-sitemap.xml if your blog is off-topic for support. This keeps the brain lean.

Mind your message and bot limits

Training pages does not cost you monthly messages, those are spent when visitors chat. But the number of bots and your overall capacity depend on your plan. The Free plan gives you 1 bot and 200 messages a month, which is plenty to test a sitemap import. Pro adds a second bot, and Agency and Scale are built for running many bots across clients or sites. See pricing for the full breakdown. If you are weighing this against another tool, the Alee vs SiteGPT comparison covers how the crawling and training differ.

Keep the brain fresh

Content changes. When you update pages, re-crawl so the brain reflects reality. Re-running a source pulls the latest text. For sites that change often, plan a periodic re-crawl, for example monthly for a fast-moving help center, quarterly for a stable marketing site. After a re-crawl, spot-check a question whose answer recently changed.

A worked example: a Shopify store

Say you run a Shopify store selling fitness gear, and you want the bot to answer product, shipping, and returns questions.

  1. You grab https store.com/sitemap.xml, which is a sitemap index pointing at sitemap_products_1.xml, sitemap_collections_1.xml, sitemap_pages_1.xml, and sitemap_blogs_1.xml.
  2. You add the index to your bot's Sources. Alee expands it and shows every URL.
  3. You include the products, collections, and pages sitemaps (these hold your catalog, your shipping policy, returns, and FAQ). You deselect the blog for now since those posts are general fitness tips, not store policy.
  4. You start the crawl. A few hundred product pages embed in the background.
  5. You test: "Do you ship to Bengaluru and what is the return window?" The bot pulls the shipping and returns pages and answers with a source link. If it cannot find a clear returns window, that tells you the returns page is thin, so you paste a clean FAQ as a raw-text source to fill the gap.

That last move is worth remembering: when a crawl leaves a gap, the fastest fix is often to paste a short, clean FAQ directly rather than re-crawling a weak page.

Polishing after the bulk import

Once the heavy lifting is done, tighten things up:

  • Check Top Questions in analytics after a few days. If real visitors ask things the bot fumbles, that page either was not ingested or is too thin. Add or rewrite the source.
  • Use the question triage inbox to mark important questions and teach better answers where the crawl fell short.
  • Mix in other source types. Sitemaps are great for web pages, but you can also add PDFs, YouTube videos (Alee uses the transcript), and pasted FAQ text to round out the brain. See features for everything you can train on, and more guides for source-specific walkthroughs.

A sitemap gets you 90 percent of the way in one click. The polish is what makes the bot feel like it actually knows your business.

Frequently asked questions

How many pages can I train one bot on from a sitemap?

There is no hard per-page wall on training itself, the brain is designed to grow as you add sources. The practical limits to watch are your plan's number of bots and your monthly message allowance, which is spent when visitors chat, not when you crawl. Start with a focused slice of your biggest site and expand in passes.

Will Alee re-crawl my pages automatically when content changes?

Re-crawling is something you trigger when you want the latest text pulled in. For sites that change often, set yourself a reminder to re-run the source on a schedule, monthly for a busy help center, quarterly for a stable site. After each re-crawl, spot-check a question whose answer recently changed.

What if my site does not have a sitemap.xml?

You can still bulk-train. Use the multiple-URLs option and paste your page links one per line, or point Alee at a hub page like your docs index and add the pages it links to. You can also fill gaps with pasted FAQ text, PDFs, and YouTube transcripts.

Ready to feed your whole site to a bot in one click? [Start free](/signup) and bulk-train your first Alee chatbot from a sitemap today.

Try it in your own Alee bot

Train it on your site, embed it anywhere, capture leads 24/7. Free to start, no card.

Keep learning