Guides · 14 min read

Sitemap Checker: Find and Fix Every XML Error

A practical guide to using a sitemap checker to validate XML sitemaps, find broken URLs, and fix indexing errors that hold your rankings back.

Your sitemap is supposed to act like a table of contents for search engines — a clean, authoritative list of every URL you want crawled and indexed. But most sitemaps have at least a few broken links, blocked pages, or duplicate entries quietly undermining your SEO. A sitemap checker finds those problems before Google does.

This guide covers what the tool actually examines, how to read the results, common errors and their fixes, and how your sitemap connects to the broader job of keeping your site indexable.

Key takeaways

A sitemap checker validates structure, HTTP status codes, and crawlability — it goes well beyond confirming the file loads.
The most common errors are wrong content-type headers, 301 redirects listed as canonical URLs, and pages blocked by robots.txt.
Google Search Console's Sitemap report is the ground-truth tool; third-party checkers are faster for ad-hoc debugging.
Sitemaps matter most for large sites, new sites with few inbound links, and sites with frequently updated content.
If your site runs a chatbot trained on your content, your sitemap is also the primary source input — broken sitemaps mean an incomplete knowledge base.

What a sitemap checker actually does

A sitemap is an XML file that lists the URLs you want search engines to crawl and index. A sitemap checker is a tool — online or command-line — that fetches that file, parses it, and reports problems that would prevent Google, Bing, or any other crawler from processing it correctly.

The check is more nuanced than "does the file exist." A good validator covers:

XML structure — is the <urlset> or <sitemapindex> root element correct? Are all tags properly closed? Is the namespace declaration present (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")?
HTTP response of the sitemap file itself — does it return 200 OK? A 301 or 404 on the sitemap URL is an immediate problem.
Content-Type header — should be application/xml or text/xml, not text/html.
Each listed URL — does every URL in the file resolve with a 200? Are any returning 301, 302, 404, or 5xx?
`robots.txt` conflict — are any listed URLs disallowed by your robots.txt? (A page that's both in the sitemap and blocked by robots is a common mistake.)
Canonical tags — does each page's <link rel="canonical"> point back to the URL listed in the sitemap, or to a different URL?
`noindex` meta tag — a page listed in a sitemap that also has <meta name="robots" content="noindex"> sends conflicting signals.
`<lastmod>` accuracy — stale or absent lastmod dates reduce the usefulness of your sitemap for crawl prioritization.
URL count — a single sitemap file can contain at most 50,000 URLs and must not exceed 50 MB uncompressed. Larger sites use a sitemap index.

Why sitemap errors kill indexing

Search engines don't guarantee indexing even for a perfect sitemap. But a broken one actively slows things down.

When Googlebot fetches a sitemap full of 301 redirects, it has to follow each redirect before determining the final URL. That uses crawl budget on large sites — budget it could spend discovering new content. When pages are listed in the sitemap but blocked in robots.txt, Google sees a contradictory signal and tends to interpret the robots.txt instruction. When lastmod dates are all set to today's date regardless of actual change (a common CMS misconfiguration), Googlebot learns to ignore them — meaning you lose the prioritization benefit entirely.

New sites feel this most acutely. A new domain with few backlinks relies heavily on its sitemap to get pages discovered at all. If the sitemap is malformed or empty, discovery slows to whatever random crawl path the bot takes.

How to run a sitemap check: step by step

Running a thorough sitemap check doesn't require expensive tools. Here's a practical workflow.

Step 1 — locate your sitemap

Most sites follow predictable conventions:

| Platform | Default sitemap URL |
|---|---|
| WordPress (Yoast) | /sitemap_index.xml |
| WordPress (Rank Math) | /sitemap_index.xml |
| Shopify | /sitemap.xml |
| Webflow | /sitemap.xml |
| Ghost | /sitemap.xml |
| Next.js (next-sitemap) | /sitemap.xml or /server-sitemap.xml |
| Custom / manual | Check /robots.txt for Sitemap: directive |

Your robots.txt file (at yourdomain.com/robots.txt) should contain a Sitemap: line pointing to the exact URL. If it doesn't, add one — Googlebot reads robots.txt before crawling and uses that line to find your sitemap.

Step 2 — run a quick HTTP check

Before any XML parsing, verify the sitemap file itself responds correctly:

```
curl -I https://yourdomain.com/sitemap.xml
```

You want HTTP/2 200 and Content-Type: application/xml (or text/xml). If you get a 301 or 302, fix the redirect so the sitemap URL itself is canonical. If you get 404, the file is missing or the URL path is wrong.

Step 3 — validate XML structure

Paste the sitemap URL into a dedicated validation tool — Google Search Console, Screaming Frog, XML Sitemap Validator, Ahrefs or Semrush's site audit. It will:

Fetch the XML and check it parses without errors.
Count the URLs listed.
Probe a sample (or all) of the listed URLs for HTTP status.
Flag robots.txt conflicts.
Report canonical mismatches.

Step 4 — cross-reference with Google Search Console

This is the most important step. Open Search Console → Indexing → Sitemaps, submit your sitemap URL if not already present, and read the report. GSC shows:

Status — "Success" means it was fetched and parsed; "Couldn't fetch" means a network or auth error.
Discovered URLs — how many URLs GSC found in the file.
Indexed — how many of those are actually in the index. A large gap here is a signal to investigate.

Don't confuse "submitted" with "indexed." GSC may discover 800 URLs from your sitemap but only index 620. The rest may be duplicates, thin content, or pages with canonical or noindex issues.

Step 5 — audit individual URL errors

If GSC reports a subset of submitted URLs as having issues, dig into Indexing → Pages and filter by reason. Common reasons:

Crawled, currently not indexed — Google visited but chose not to index (content quality or duplication concern).
Alternate page with proper canonical tag — there's a preferred canonical version; the sitemap URL is not it.
Blocked by robots.txt — fix your robots.txt or remove from sitemap.
Not found (404) — the URL was deleted or moved; update your sitemap.

Common sitemap errors and how to fix them

Redirected URLs in the sitemap

Listing http:// URLs when your site is https://, or listing pre-redirect URLs, is one of the most widespread sitemap mistakes. Each redirect costs crawl budget and dilutes signals.

Fix: Update the sitemap to list the final, canonical HTTPS URL. In WordPress with Yoast or Rank Math, regenerating the sitemap after switching to HTTPS usually handles this automatically. Double-check by running your sitemap through a checker that resolves redirects.

Pages blocked by robots.txt

A User-agent: * Disallow: /blog/ rule paired with a sitemap that lists all blog URLs creates a contradiction. Google's official guidance: it will respect the robots block but may note the contradiction.

Fix: Either remove the block in robots.txt (if those pages should be indexed) or remove the pages from the sitemap (if they should stay blocked). Never list a URL in your sitemap that you don't want crawled.

Wrong or missing namespace

The root <urlset> element must include the correct namespace:

```xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
```

Missing or mistyped namespace causes some parsers to reject the entire file. Any decent XML validator will flag this immediately.

Sitemap too large

Exceeding 50,000 URLs or 50 MB uncompressed causes most search engines to stop parsing at the limit. The fix is a sitemap index file that references multiple child sitemaps:

```xml
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yourdomain.com/sitemap-posts.xml</loc>
</sitemap>
<sitemap>
<loc>https://yourdomain.com/sitemap-products.xml</loc>
</sitemap>
</sitemapindex>
```

Stale `<lastmod>` dates

Some CMSs set <lastmod> to the current date on every regeneration, even for pages that haven't changed. Googlebot learns from your site over time — if lastmod is always "today," it stops trusting the field.

Fix: Set <lastmod> only when actual page content changes. In WordPress, Yoast and Rank Math both have settings to control this. If you can't control it, omitting <lastmod> entirely is better than lying about it.

Choosing the right sitemap checker tool

There are four categories of validation tool, each with different trade-offs.

Google Search Console (free, authoritative)

The only tool that shows you what Google actually sees and has indexed. Use it for: confirming your sitemap was fetched, seeing the indexed/submitted gap, and diagnosing specific URL-level errors.

Limitation: Slow. It can take days for GSC data to reflect recent changes. Not suitable for pre-launch validation or rapid iteration.

Online XML validators (free, fast for small sites)

Tools like XML-Sitemaps.com's validator, W3C's XML validator, or dedicated sitemap linting tools will immediately parse your XML and report structural errors. Most are free and require no account.

Limitation: They typically don't check each URL's HTTP status, don't check robots.txt conflicts, and have URL-count limits (usually 500–2,000 URLs for free tiers).

Screaming Frog SEO Spider (paid, most thorough)

Screaming Frog crawls your entire site and can also crawl your sitemap separately, comparing listed URLs against what it actually finds. It checks status codes, canonical mismatches, and noindex conflicts for every URL in the file.

Limitation: Desktop software (Windows/Mac/Linux). The free tier caps at 500 URLs. The paid licence (roughly $250/year) is worthwhile for agencies and larger sites.

SEO suite audits (Ahrefs, Semrush, Moz — paid)

These platforms run site audits that include sitemap validation as part of a broader technical SEO sweep. Good for ongoing monitoring and issue tracking over time. Ahrefs Site Audit and Semrush Site Audit both surface sitemap issues in their crawl reports.

Limitation: Premium pricing. Overkill if you only need a sitemap check. If you want to compare AI chatbot platforms that also validate your sitemap during ingestion, see our side-by-side comparison or review Alee's pricing to find a plan that covers your site size.

Quick comparison

| Tool | Cost | Speed | URL limit | Checks status codes | GSC integration |
|---|---|---|---|---|---|
| Google Search Console | Free | Slow (days) | Unlimited | Partial | Native |
| XML-Sitemaps validator | Free | Instant | ~2,000 | No | No |
| Screaming Frog | Free/Paid | Fast | 500 / unlimited | Yes | Yes |
| Ahrefs Site Audit | Paid | Fast | Unlimited | Yes | Yes |
| Semrush Site Audit | Paid | Fast | Unlimited | Yes | Yes |

Sitemaps and your AI chatbot knowledge base

This is a connection most guides miss: your XML sitemap is not only for search engines. If you're running an AI chatbot trained on your website content — like Alee — the sitemap is the most efficient way to tell the bot which pages to ingest.

Alee's "Website URL / Sitemap" ingestion option reads your sitemap, follows every listed URL, and chunks the content into a vector knowledge base. If your sitemap lists 200 pages, the bot gets trained on all 200. If your sitemap has errors — broken URLs, robots.txt blocks, redirect chains — the bot silently skips those pages, and your chatbot ends up with a partial, inconsistent knowledge base.

Running a sitemap check before you train a chatbot on your site is the same discipline as running it before an SEO audit. Clean sitemap = complete, accurate knowledge base. The chatbot then answers visitor questions grounded only in your content, with sources — no hallucinations. You can explore Alee's features to see exactly how ingestion works.

Sitemap best practices for 2026

A few things have shifted in how search engines process sitemaps that aren't reflected in older guides.

Priority and changefreq are mostly ignored. The <priority> and <changefreq> fields were deprecated in practice years ago. Google has said publicly it doesn't use them. Don't spend time on them; leave them out or set them uniformly.

Video and image sitemaps still matter. If you have video content or image-heavy pages, separate video or image sitemap extensions help Google discover media that might otherwise be missed. These use additional XML namespaces (xmlns:video, xmlns:image).

Hreflang in sitemaps vs. HTML. For multilingual sites, you can declare hreflang annotations in the sitemap rather than (or in addition to) the HTML head. Both approaches work; in-sitemap declarations are easier to manage at scale. If you're using in-sitemap hreflang, a tool that validates hreflang consistency is essential — mismatched language codes or missing x-default values are extremely common.

Dynamic sitemaps beat static files. A weekly-regenerated static sitemap misses pages added in between. A dynamically generated sitemap — served from your routing layer — always reflects current site state. Most modern CMSs and frameworks support this out of the box.

Submit to Bing too. Bing Webmaster Tools has a sitemap interface similar to GSC, and Bing powers Yahoo results. If you're not submitting there, you're leaving discovery coverage on the table.

Automating sitemap checks

Manual checks are fine for a one-time audit but break down for sites that publish frequently or have large product catalogs. Automation options:

Screaming Frog + scheduled crawl — Screaming Frog's scheduled crawl feature (paid) can email you a report when new errors appear.
Ahrefs / Semrush alerts — both platforms support "notify me when critical issues are found" emails triggered by their periodic site crawls.
GitHub Actions / CI pipeline — for developer-owned sites, running a lightweight sitemap lint step in your deploy pipeline catches structural errors before they reach production. Tools like sitemap-linter (npm) or a simple xmllint invocation can do this.
Google Search Console email alerts — GSC sends an email when it can't fetch your sitemap or when a page's index status changes significantly. Make sure Search Console email notifications are enabled for your property.

For most teams, monthly Screaming Frog audits plus GSC email alerts catch sitemap issues before they cause ranking drops. For a broader workflow covering sitemaps, robots.txt, and schema, the tutorials section has step-by-step walkthroughs.

Sitemap checker for large and enterprise sites

Larger sites introduce complications that smaller sites don't face.

Faceted navigation and parameter URLs

E-commerce and content sites often generate thousands of near-duplicate URLs through facets, sort parameters, and pagination. Include only canonical, indexable URLs in the sitemap. Use robots.txt to block parameter URL crawling and keep crawl budget on pages that actually deserve attention.

JavaScript-rendered content

If your site is rendered client-side (React, Vue, Angular SPA without SSR), Googlebot may not see the rendered content when it crawls. Your sitemap may list URLs that technically exist but whose content Google sees as near-empty.

Verify with the URL Inspection tool in GSC: paste a listed URL, click "Test Live URL," and compare the HTML source with what you see in a browser. If the rendered content isn't showing in the "View tested page" output, you have a JavaScript rendering problem that no amount of sitemap optimization will fix — you need SSR or static generation.

Crawl budget allocation

For sites with millions of pages, Googlebot won't crawl everything in one pass. A well-organized sitemap index with accurate <lastmod> dates helps it prioritize recently changed pages and skip unchanged ones.

When sitemaps don't help (and what to do instead)

Sitemaps are not a ranking signal. Submitting a URL in a sitemap doesn't guarantee indexing, and it won't cause Google to rank a page it otherwise wouldn't.

Pages with thin content, duplicate content, or poor user signals (high bounce rate, low dwell time) may stay out of the index even with a perfect sitemap. Validation tools confirm technical health — they can't tell you whether the content deserves to rank. For the content side of the equation, browse more guides on building indexable, high-quality pages.

If you have a large indexed/submitted gap in GSC, the sitemap is working; the issue is content quality or crawl policy decisions Google is making about your pages. In that case, the next step is reviewing the "Pages" report in GSC for the specific reason each non-indexed URL was excluded.

Alee users often find this in reverse: training surfaces which pages were ingested, and the gaps reveal unreachable pages — a fast signal that something in the sitemap or URL structure needs attention. Start free to try ingesting your site.

Frequently asked questions

What is a sitemap checker and why do I need one?

A sitemap checker is a tool that fetches your XML sitemap and validates its structure, the HTTP responses of each listed URL, and conflicts with robots.txt or canonical tags. You need one because a broken sitemap quietly prevents pages from being discovered and indexed, and the symptoms — missing traffic, pages not appearing in search — often don't surface for weeks.

How often should I run a sitemap check?

For most sites, a monthly check is sufficient. Sites that publish daily or run large e-commerce catalogs benefit from weekly checks or automated monitoring. Always run a check after a major site migration, platform change, or CMS update — those events most commonly introduce sitemap errors.

Is Google Search Console's sitemap tool enough, or do I need a third-party sitemap checker?

GSC is authoritative for what Google has processed, but it's slow and doesn't proactively catch every structural error. The best workflow pairs a third-party sitemap checker — Screaming Frog or an online XML validator — for fast pre-launch checks, with GSC for ongoing monitoring and ground-truth indexing data.

Can a sitemap improve my search rankings?

Not directly — sitemaps are a discovery and crawl-efficiency tool, not a ranking signal. A clean sitemap helps ensure your pages are discovered and considered for indexing. Once indexed, rankings depend on content quality, relevance, and authority signals. A broken sitemap can hurt rankings indirectly by keeping pages out of the index entirely.

My sitemap shows 500 URLs but GSC only indexes 380. What's wrong?

This gap is common and usually means the 120 non-indexed URLs have a quality, duplication, or crawl issue — not a sitemap problem. Open the Pages report in GSC, filter by "Not indexed," and read the reason for each. Common reasons: Duplicate without user-selected canonical, Crawled, currently not indexed (thin content), or Alternate page with proper canonical tag. Fix the underlying content or canonical issue; cleaning up the sitemap to remove genuinely non-indexable URLs is good housekeeping but won't by itself resolve the gap.

---

If there's one habit worth building into your SEO workflow, it's running a sitemap check after every significant site change. A ten-minute validation catches errors that would otherwise drain indexing coverage for months — and if your site powers an AI chatbot, a clean sitemap means a complete knowledge base for every visitor question.

[Alee](/) ingests your sitemap automatically when you add your site as a training source — train your chatbot in minutes and see exactly which pages made it into the knowledge base. [Get started free](/signup).

Build your own AI chatbot with Alee

Train it on your site, embed it anywhere, capture leads 24/7. Free to start.