✨ Train your first AI chatbot free — no credit card neededStart free →
Alee
← All resources
Guides · 15 min read

Sitemap Issues Checker: Diagnose & Fix Every Error

Use a sitemap issues checker to find broken URLs, robots.txt conflicts, canonical mismatches, and more. Step-by-step diagnosis with tools and fixes.

Your sitemap could be silently broken right now — and you'd have no idea until you notice a ranking drop months later. A sitemap issues checker closes that gap: it probes every URL in your XML file, validates the structure, and surfaces the exact conflicts that stop search engines from processing your pages.

This guide goes beyond the basics. You'll get a full diagnostic workflow, a severity-ranked issue map, concrete fixes, and a monitoring framework — so you catch problems before Google does.

Key takeaways

  • A sitemap issues checker validates structure, probes every listed URL for HTTP status, and flags robots.txt conflicts, canonical mismatches, and noindex contradictions.
  • Triage by severity: 404 errors and robots.txt blocks are critical; stale <lastmod> dates are low priority. Fix in that order.
  • Pair Google Search Console (authoritative, slow) with a third-party checker (fast, thorough) for complete coverage.
  • A large indexed/submitted gap in GSC signals content quality or canonical issues at the page level — not a sitemap problem.
  • A broken sitemap silently limits AI chatbots trained on your content; fix issues before ingesting your site.

What a sitemap issues checker actually looks for

Most people assume a sitemap is fine if it loads in a browser. That misses most real issues. A proper sitemap issues checker covers at least six categories of validation:

1. XML structure validation

The sitemap must be well-formed XML — valid declaration, correct root element (<urlset> or <sitemapindex>), right namespace, properly closed tags. One typo in the namespace string causes some parsers to reject the file silently. It loads fine in a browser; Googlebot quietly ignores it.

```xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
```

2. Sitemap file HTTP status

The sitemap URL itself must return 200 OK. A 301 on the sitemap URL forces crawlers to follow the redirect before they can start reading — common after migrations where /sitemap.xml moved to /sitemap_index.xml.

3. URL-level HTTP status checks

Every listed URL gets probed for its HTTP response. A sitemap with 40 URLs returning 404 and 15 returning 301 signals a poorly maintained site — dead URLs burn crawl budget that could go to your best content.

4. robots.txt conflict detection

A URL both listed in your sitemap and blocked by robots.txt sends contradictory instructions. Google typically respects the robots.txt block, notes the conflict in Search Console, and keeps the page out of the index.

5. Canonical tag mismatches

Every page in your sitemap should have a <link rel="canonical"> pointing back to the same URL. If the canonical points elsewhere — say, the sitemap lists /products?page=2 but the canonical points to /products — the sitemap entry is effectively useless.

6. noindex contradictions

A page with <meta name="robots" content="noindex"> shouldn't appear in your sitemap. Google resolves the contradiction by honoring noindex, but the conflicting signal wastes the crawler's time and confuses your intent.

The full map of sitemap issues by severity

Not every issue deserves the same urgency. Here's how to triage when your audit returns a list of problems.

| Issue type | Severity | Impact | Fix urgency |
|---|---|---|---|
| Sitemap file returns 404 or 500 | Critical | Entire sitemap is inaccessible | Immediate |
| URLs returning 404 | Critical | Dead pages waste crawl budget; no indexing possible | Immediate |
| robots.txt blocks listed URLs | High | Pages blocked from indexing; contradictory signal | Within 24 hours |
| URLs returning 301/302 | High | Crawl budget wasted; canonical URL not submitted | Within 24 hours |
| noindex pages listed in sitemap | High | Contradictory signal; no indexing will occur | Within 24 hours |
| Canonical mismatches | High | Sitemap entry ignored; authority not consolidated | Within 48 hours |
| Wrong Content-Type header | Medium | Some parsers may reject the file | Within a week |
| Missing XML namespace | Medium | Some parsers silently skip the file | Within a week |
| Stale or inaccurate <lastmod> | Low | Googlebot stops trusting the field over time | Next audit cycle |
| Missing <lastmod> | Low | Minor crawl prioritization loss | Optional |
| <priority> / <changefreq> inaccuracies | Negligible | These fields are largely ignored by Google | Don't bother |
| Sitemap exceeds 50,000 URLs or 50 MB | Critical | Parser stops at the limit; URLs after it are missed | Immediate |

Work top-to-bottom. A sitemap with 20 canonical mismatches is annoying; a sitemap that returns 500 is a five-alarm fire.

Step-by-step: how to run a complete sitemap issues check

A thorough check takes 20–30 minutes the first time; 10 minutes once you have a routine.

Step 1 — find your sitemap URL

Before you can check it, you need to know where it is. Most platforms use predictable paths:

| Platform | Sitemap URL |
|---|---|
| WordPress (Yoast or Rank Math) | /sitemap_index.xml |
| Shopify | /sitemap.xml |
| Webflow | /sitemap.xml |
| Ghost | /sitemap.xml |
| Next.js (next-sitemap) | /sitemap.xml |
| Squarespace | /sitemap.xml |
| Custom build | Check robots.txt for Sitemap: directive |

Open yourdomain.com/robots.txt and look for Sitemap: https://yourdomain.com/sitemap.xml. If it's not there, add it — every crawler that respects robots.txt will find your sitemap automatically.

Step 2 — check the sitemap file itself

Before checking the content, verify the container:

```bash
curl -I https://yourdomain.com/sitemap.xml
```

You want HTTP/2 200 and content-type: application/xml. A 301 means the sitemap URL has moved — update robots.txt and Search Console to point to the final destination. A 404 means the file is missing or the path is wrong.

Step 3 — validate XML structure

Paste your sitemap URL into a dedicated checker. Google Search Console is authoritative but slow; for immediate structural feedback use one of these:

  • XML-Sitemaps.com validator — free, instant, handles up to ~2,000 URLs
  • Screaming Frog SEO Spider (sitemap mode) — crawls all listed URLs, checks status codes and canonical tags; free up to 500 URLs, paid beyond that
  • Ahrefs Site Audit or Semrush Site Audit — include sitemap checks as part of a broader technical sweep

Step 4 — check every listed URL

The checker fetches each URL and records the HTTP status. You're looking for 200 OK across the board. Flag anything else:

  • 301/302 — redirect; update the sitemap to list the final URL directly
  • 404 — broken; remove or restore the page
  • 403 — forbidden; crawlers can't read it
  • 5xx — server error; fix at the server level

For sitemaps over 10,000 URLs, run Screaming Frog in sitemap crawl mode overnight rather than relying on a free online tool.

Step 5 — audit robots.txt, canonicals, and noindex tags

Compare your robots.txt disallow rules against the URL patterns in your sitemap. Any URL both listed and blocked by Disallow: is a contradiction — decide which instruction you mean and remove the other.

For canonical and noindex issues: inspect the HTML head of each flagged page. The canonical should match the sitemap URL exactly — same protocol, trailing slash, no extra parameters. Any meta name="robots" content="noindex" on a sitemap-listed page should be removed or that page dropped from the sitemap.

Step 6 — cross-reference with Google Search Console

Submit your sitemap in Search Console → Indexing → Sitemaps and check the report. Watch for:

  • Discovered URLs vs. Indexed — a large gap usually means content quality or canonicalization issues at the page level, not a broken sitemap
  • "Couldn't fetch" — network or auth error; check server logs
  • "Has errors" — expand for specific error types

Third-party tools validate what the file says; GSC shows what Googlebot actually did with it.

Common sitemap issues explained and fixed

Redirect chains in the sitemap

This is the most common issue in sitemaps after site migrations. You switched from HTTP to HTTPS, or restructured URLs, but the sitemap still lists the old paths. Googlebot follows each redirect before landing on the final URL — burning crawl budget on every hop.

Fix: List only the final URL. For HTTP → HTTPS migrations, confirm every sitemap entry starts with https://. In WordPress, regenerating the sitemap after a migration usually handles this; verify with an audit tool before resubmitting.

URL in sitemap but blocked by robots.txt

You want the page indexed (that's why you listed it), but your robots.txt says Disallow: /category/. This happens when developers block crawling during development and forget to undo specific rules before going live.

Fix: Decide which instruction you mean. Should the page be indexed? Remove or narrow the robots.txt disallow rule. Should it stay hidden? Remove the URL from the sitemap. Never leave contradictory instructions in place.

noindex on a page in the sitemap

Tag pages, author archives, and thin content pages often get noindex tags added by SEO plugins — correctly. But the same plugin may still include those URLs in the sitemap. Running a tool that crawls each listed URL and inspects meta robots tags catches this immediately.

Fix: In Yoast or Rank Math, configure which post types and taxonomies appear in the sitemap. Pages set to "noindex" should auto-exclude — confirm this in your plugin's sitemap settings.

Canonical points to a different URL than the sitemap entry

Imagine your sitemap lists https://yourdomain.com/blog/post-title/ (with trailing slash), but the canonical tag says https://yourdomain.com/blog/post-title (no trailing slash). Google treats these as different URLs — the sitemap entry is ignored.

Fix: Standardize your URL format across the site — canonical tags and sitemap entries must match exactly. In WordPress, the Yoast "Canonical URL" setting controls this.

Sitemap includes paginated URLs

/blog/page/2/, /page/3/, etc. are duplicate or thin content and generally shouldn't appear in your sitemap. They send crawlers down a path that leads nowhere indexable.

Fix: In Yoast, go to SEO → Search Appearance → Archives → Post pagination and set to noindex — those pages auto-exclude from the sitemap. Rank Math has the same option under sitemap settings.

<lastmod> dates always equal "today"

Some CMS plugins set <lastmod> to the current date every time the sitemap regenerates — even for pages unchanged in years. Googlebot eventually learns to ignore the field entirely.

Fix: Configure your plugin to use actual content modification dates. In WordPress that's the post_modified_gmt field. If the plugin can't be configured this way, omit <lastmod> entirely — an absent date is less damaging than a systematically wrong one.

Choosing the right sitemap issues checker

Different tools suit different situations. Here's how to pick without overspending.

Google Search Console (free, authoritative, slow): best for confirming what Google processed and monitoring the indexed/submitted gap. Data lags 48–72 hours — don't rely on it for rapid post-change validation.

Online XML validators (free, fast, limited): XML-Sitemaps.com and the W3C validator confirm the file is well-formed. Free and instant, but most cap at 500–2,000 URLs and don't check HTTP status codes.

Screaming Frog SEO Spider (best for one-off audits): run in "sitemap" mode to fetch every listed URL, check status codes, canonical tags, noindex flags, and robots.txt conflicts. Free up to 500 URLs; roughly £250/year for unlimited.

Ahrefs / Semrush Site Audit (best for ongoing monitoring): periodic crawls with email alerts when new issues appear. Higher cost — these are full SEO platforms — but automatic scheduling is the main advantage.

The sweet spot for most teams: Screaming Frog for fast post-change audits, GSC for ongoing ground-truth monitoring. See how Alee compares to SiteGPT if you're also evaluating AI-assisted content ingestion alongside your SEO tooling.

Sitemap issues checker and your AI knowledge base

There's a less-obvious reason to take sitemap health seriously: your AI chatbot may be reading your sitemap too.

If you're running an AI assistant trained on your website content — like Alee — the sitemap is typically how the platform discovers which pages to ingest. Alee's "Website URL / Sitemap" source reads your sitemap, follows every listed URL, and builds a vector knowledge base from the content it finds.

If your sitemap has broken URLs, those pages are skipped. If it has robots.txt conflicts, those pages can't be read. If entire sections are missing because your plugin excluded the wrong post types, your chatbot simply won't know about those topics.

Running this check before setting up a content-trained chatbot is the same due diligence as running one before a Search Console submission. A clean sitemap means a complete knowledge base — and a chatbot that can answer your visitors' questions accurately, sourced from your real content.

Explore how this works on Alee's features page. If you're weighing plans, the pricing page covers what's included at each tier.

Automating your sitemap issue monitoring

A one-time check is a start — but sites change constantly. Here's how to build recurring coverage without much overhead.

GSC email alerts. Search Console emails you when it can't fetch your sitemap or when index status changes significantly. Enable under Settings → Messages → Email. Free, zero maintenance.

Scheduled Screaming Frog crawls. The paid tier runs crawls on a schedule and emails issue-diff reports — only flagging problems that weren't present in the previous crawl. Set weekly or monthly based on your publish cadence.

CI/CD sitemap linting. For developer-owned sites, add structural validation to your deploy pipeline:

```yaml

  • name: Validate sitemap

run: xmllint --noout https://yoursite.com/sitemap.xml
```

This catches XML errors before they reach production — a fast, free safety net that won't cover URL-level issues.

Ahrefs / Semrush crawl alerts. Both platforms generate issue-diff reports on a schedule and email you when errors appear or worsen.

Sitemap issues checkers for large and enterprise sites

Large sites introduce complexity that standard tools don't handle well.

Sitemap index validation

Sites over 50,000 URLs use a sitemap index — a parent XML file that references child sitemaps. A good sitemap audit tool validates each child, not just the index. Common problems: child sitemaps returning 404, individual child files exceeding the 50,000 URL limit, and namespace inconsistencies between the index and child files. Screaming Frog and GSC both follow sitemap index references automatically.

JavaScript-rendered pages

SPAs (React, Vue, Angular without SSR) often list URLs in the sitemap correctly but serve near-empty HTML to crawlers — the content loads client-side. A sitemap checker sees 200 OK and moves on; Googlebot sees a blank page. Verify with the URL Inspection tool in GSC: "Test Live URL" → compare the rendered output to what you see in a browser. If it's empty, fix your rendering setup first — no sitemap tool can resolve that for you.

Faceted navigation URL explosion

E-commerce filter URLs (/products?color=blue&size=M) can generate thousands of near-duplicate pages. If these end up in the sitemap — usually a plugin misconfiguration — you're wasting crawl budget. Configure your generator to exclude query-string URLs and use canonical tags on filter pages to consolidate authority to the base category URL.

Before you submit: a sitemap validation checklist

Run through this before submitting or resubmitting any sitemap — whether it's a first-time setup, post-migration, or after a CMS update.

  • [ ] Sitemap URL returns 200 OK with content-type: application/xml
  • [ ] XML is well-formed with correct root element and namespace
  • [ ] All listed URLs return 200 OK — no 404s, 301s, or 5xx
  • [ ] No listed URL is blocked by a robots.txt Disallow rule
  • [ ] Every page's canonical matches its sitemap URL exactly
  • [ ] No listed URL carries a noindex meta tag
  • [ ] <lastmod> reflects actual content modification dates, or is omitted
  • [ ] File is under 50,000 URLs and 50 MB uncompressed
  • [ ] Sitemap URL is declared in robots.txt with a Sitemap: directive
  • [ ] Submitted in Google Search Console and Bing Webmaster Tools

See Alee's tutorials for how this checklist applies when connecting your sitemap as a chatbot knowledge source.

When your sitemap check is clean but pages still aren't indexed

A common frustration: you fix every flagged issue, resubmit to GSC — and pages still aren't indexed. That's almost never a sitemap problem. Here's what it actually is:

Thin content. Pages with little original text or near-duplicate content get excluded by Google's quality filters regardless of sitemap health — a checker validates structure, not content quality.

Crawled but not indexed. GSC's "Pages" report flags these as "Crawled, currently not indexed." Google visited and passed; the problem is a quality signal, not a technical error.

Soft 404s. A page returns 200 OK but shows "no results found." Checkers pass it; only manual review catches it.

Orphan pages. Pages with no internal links get crawled infrequently even when listed in the sitemap. Google favors well-linked pages.

The path forward is content and architecture work, not more sitemap debugging. Explore more guides on building indexable, well-linked pages.

Frequently asked questions

What is a sitemap issues checker and what does it find?

A sitemap issues checker fetches your XML sitemap, validates its structure, and probes every listed URL for problems — including broken pages (404s), redirect chains (301s), pages blocked by robots.txt, noindex contradictions, and canonical mismatches. It's the diagnostic tool that tells you exactly why certain pages aren't being crawled or indexed.

Which free sitemap issues checker is most reliable?

Google Search Console is the most authoritative — it shows exactly what Google processed and any errors it encountered. For faster structural validation before submitting to GSC, XML-Sitemaps.com works well for sitemaps up to 2,000 URLs. For URL-level checking on small sites, Screaming Frog's free tier (500 URLs) is the most thorough without a paid account.

How do I fix a sitemap that GSC says "has errors"?

Click through to the specific sitemap and expand the error list. "Couldn't fetch" means the sitemap URL isn't accessible — check HTTP status and robots.txt. "Couldn't parse" means malformed XML — run it through an XML validator. URL-specific errors link to the Pages report with the reason each URL was excluded.

My checker shows broken URLs — remove them or redirect them?

Both are valid, but updating the sitemap is required either way. If you've redirected the old URL, update the sitemap to list the final destination directly. If the page is gone with no useful redirect target, remove it from the sitemap. Listing dead or redirected URLs wastes crawl budget.

How often should I run a sitemap issues check?

Monthly is sufficient for most sites. Run an immediate check after any site migration, URL restructure, CMS upgrade, major content deletion, or SEO plugin switch. High-volume sites benefit from weekly automated checks via Screaming Frog scheduled crawls or Ahrefs Site Audit.

---

Sitemap maintenance isn't glamorous work, but a 20-minute audit can uncover years' worth of quiet indexing losses. Fix critical issues first, automate monitoring for ongoing coverage, and resubmit to Search Console after any significant change.

[Alee](/) reads your sitemap automatically when you add your site as a training source — a clean sitemap means a complete AI knowledge base from day one. [Start free](/signup) and see which pages make it in.

Build your own AI chatbot with Alee

Train it on your site, embed it anywhere, capture leads 24/7. Free to start.

Related reading