Guides · 14 min read

Sitemap Analytics: Read the Data That Drives Indexing

Learn how to use sitemap analytics in Google Search Console and beyond to fix crawl gaps, improve indexation, and turn submission data into SEO wins.

Most site owners submit an XML sitemap and immediately forget it exists. That's a mistake — because the same sitemap that tells Google where to crawl also generates a stream of data showing exactly which pages got indexed, which got ignored, and why. Sitemap analytics is the practice of reading that data systematically and acting on it.

Done well, this practice closes the loop between "I published content" and "search engines can actually find and rank it." Done poorly — or not done at all — you can run a technically clean site for months while a large slice of your pages sit in indexing limbo.

Key takeaways

The Sitemap report in Google Search Console is your primary sitemap analytics dashboard — it shows submitted vs. indexed counts per sitemap file, updated on each crawl cycle.
A big gap between "submitted" and "indexed" is a signal, not a verdict — the root cause could be content quality, duplicate content, crawl budget, or server speed.
Segment your sitemaps by content type (blog posts, product pages, landing pages) so you can read crawl data per segment instead of averaging everything together.
The <lastmod> tag helps Googlebot prioritize fresh content — but only if the dates are accurate and change only when content actually changes.
Crawl and indexing data feeds into your content pipeline: pages that Google ignores are candidates for consolidation, noindex, or improvement before you invest more in similar content.
Tools like Alee can use your sitemap as a live content source, meaning sitemap health directly affects your chatbot's knowledge base — bad indexing and stale sitemaps hurt both channels simultaneously.

---

What sitemap analytics actually measures

This practice isn't captured by a single tool or a single number. It's a cluster of data points spread across a few places, and the skill is knowing which signal belongs where.

The core metrics break into three categories:

Submission metrics — what you've told Google about your site:

Number of URLs submitted across each sitemap file
Last submission date
Whether the sitemap returned errors (HTTP 200, parseable XML, correct content-type)

Index metrics — what Google actually did with those submissions:

URLs indexed vs. URLs submitted (the gap is where most SEO problems live)
Indexed pages that are NOT in the sitemap (can indicate orphaned content or canonicalization issues)
Index coverage status per URL: Indexed, Crawled but not indexed, Discovered but not crawled, Excluded

Crawl metrics — how often and how deeply Googlebot visits:

Crawl frequency per URL (approximated from server logs or log-analysis tools)
Crawl budget allocation across sections of the site
Time-to-index for newly submitted URLs

Most site owners only look at the first category. The second and third are where the actionable decisions actually live.

---

Setting up sitemap analytics in Google Search Console

Google Search Console (GSC) is the canonical source for this data. If you haven't submitted your sitemap yet, start at Indexing → Sitemaps in your GSC property and paste the sitemap URL (usually yourdomain.com/sitemap.xml or yourdomain.com/sitemap_index.xml).

Once submitted, the Sitemaps report shows:

Status: Success, Has errors, or Couldn't fetch
Submitted: total URLs in the file
Indexed: total of those URLs that Google has added to its index
Last read: when Google last fetched your sitemap file

The single most important figure here is the submitted-to-indexed ratio. A ratio below 70% on a content site that's been live for more than six months deserves investigation. On a new site or a site that publishes dozens of pages per week, a lower ratio is normal while Google catches up.

Reading the Coverage report alongside sitemaps

The Sitemap report gives you counts; the Coverage report (Indexing → Pages in GSC) gives you the breakdown by status. Filter it by "Submitted via sitemap" to see only URLs you've explicitly told Google about. The status categories you'll see:

| Status | What it means | Typical action |
|---|---|---|
| Indexed, not submitted in sitemap | Google found it elsewhere — could be good or messy | Decide whether to add to sitemap or noindex |
| Submitted and indexed | Healthy | Monitor for drops |
| Submitted, indexed, marked noindex | Conflict — your sitemap and meta tag disagree | Remove from sitemap or remove the noindex |
| Crawled, currently not indexed | Google crawled it but didn't add it | Check thin content, duplication, internal linking |
| Discovered, currently not indexed | In the queue but not yet crawled | Could be crawl budget, internal link depth, or just recency |
| Excluded by noindex | You blocked it | Correct if unintentional |

Don't try to fix every row at once. Sort by volume and work top-down — the biggest clusters tell you the systemic problem.

---

The submitted vs. indexed gap: diagnosing the real cause

A gap between submitted and indexed URLs is the most common finding in sitemap analytics, and it's almost never just one thing. Here's a structured approach to diagnosing it.

Step 1 — Rule out submission errors first

Before assuming content quality, confirm your sitemap file is actually valid:

Does yourdomain.com/sitemap.xml return HTTP 200?
Is the content-type application/xml or text/xml?
Does the XML parse without errors? (Paste it into an XML validator.)
If using a sitemap index, does each child sitemap file also load cleanly?

One common trap: a plugin generates the sitemap correctly, but a caching layer serves a stale, malformed version. Test directly from curl, not your browser cache.

Step 2 — Check for thin or near-duplicate content

"Crawled, currently not indexed" almost always points here. Google's guidelines are clear: pages with little unique value relative to what already exists in its index won't be indexed. This applies to:

Category pages with only two or three products
Tag pages that duplicate category content
Location pages generated from a template with minimal unique text per city
Blog posts under 400 words with no original perspective

If you're seeing large volumes in this bucket, open a sample of those URLs and ask honestly: does this page deserve to exist as a standalone document?

Step 3 — Check crawl budget allocation

Crawl budget matters on sites with 10,000+ URLs. If Googlebot is spending its allocation on pagination, parameter URLs, or filtered pages, important content waits in the queue longer. Signs of crawl budget waste:

The Coverage report shows "Discovered, not crawled" for important pages live for weeks
Server logs show Googlebot hitting /category?sort=price&page=47 dozens of times daily
Your robots.txt doesn't disallow parameter-heavy URLs

Fix it by disallowing low-value URL patterns in robots.txt and using <priority> and <changefreq> tags as hints — not commands, but guidance Google can act on.

Step 4 — Look at page speed and server response time

Googlebot has a crawl time budget in addition to a crawl request budget. Pages that take more than three to four seconds to respond get fewer revisits. If you're on shared hosting or your server slows under load, that alone can suppress indexation rates.

---

Segmenting your sitemaps for better data

One sitemap file containing all your URLs makes it impossible to read crawl data by content type. Split your sitemap by section and submit each file separately to GSC. That way you see indexed ratios per segment:

```xml

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yoursite.com/sitemap-blog.xml</loc>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-products.xml</loc>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-landing-pages.xml</loc>
</sitemap>
</sitemapindex>
```

Now your GSC Sitemaps report shows three rows. If your blog has a 90% indexed ratio but your landing pages are at 45%, you know exactly where to focus. Without segmentation you'd see one blended number and have no idea where the problem lives.

This is the single highest-leverage structural change most sites can make to their sitemap setup. It costs almost nothing to implement and makes the resulting data dramatically more actionable.

---

Using `<lastmod>` to improve crawl prioritization

The <lastmod> tag in your sitemap tells Google when a URL was last modified. Used correctly, it helps Googlebot recrawl updated content faster. Used incorrectly, it trains Googlebot to ignore your signals entirely.

What "correctly" looks like:

Update <lastmod> only when the page's actual content changes — a published revision, a price update, a new section
Use the full ISO 8601 format: 2026-06-18T09:30:00+05:30
Don't update it on every sitemap regeneration if the page content didn't change

What "incorrectly" looks like:

Auto-generating <lastmod> as the current date/time on every crawl (the default behavior of some plugins)
Setting every page's <lastmod> to the same date
Omitting <lastmod> entirely on content that changes frequently

When you set <lastmod> accurately and consistently, your sitemap analytics in GSC will start to show shorter time-to-index for updated pages. That's a direct feedback loop: better metadata → faster crawls → better data → further improvements.

---

Log-file analytics vs. sitemap analytics: complementary sources

GSC's Sitemap report is Google's view of your sitemap. Your server access logs are the unfiltered record of every request Googlebot actually made. The two sources answer different questions:

The GSC Sitemaps report tells you:

Which URLs Google chose to index from your submissions
Which submissions have errors
How the indexed count changes over time

Log-file analytics tells you:

Which URLs Googlebot is crawling most (often very different from what you'd expect)
How often each URL is revisited
Whether Googlebot is hitting URLs not in your sitemap at all
Whether crawl frequency correlates with your <lastmod> updates

Together they close the loop. A URL that's "discovered, not indexed" in GSC but showing zero Googlebot hits in your logs means something is blocking the crawl entirely — server error, robots.txt block, or crawl budget exhaustion. A URL showing frequent crawl hits but "crawled, not indexed" in GSC suggests Google is visiting but not liking what it finds.

Log-file analysis requires raw server log access (via your hosting panel or a CDN like Cloudflare) or a tool like Screaming Frog Log File Analyser. It's overkill for small sites; essential for sites with 10,000+ URLs.

---

Crawl data for news and e-commerce content

Static blog sites have it relatively simple — publish, submit, monitor. Two content types create their own indexing challenges though.

News sites and Google News sitemaps

Google News has a separate sitemap protocol (news:news namespace) and indexes articles within 48 hours of publication — but only articles published within the last two days are eligible for the News sitemap. The analytics challenge here is speed: you need to know within hours whether a story got picked up, not days.

The GSC News report (Search results → filter by Search type: Google News) gives you impression data per article. Cross it with sitemap submission timestamps to calculate time-to-impression — a practical proxy for time-to-index on news content.

E-commerce and product catalog sitemaps

Product pages have unique challenges: high turnover (products go out of stock or get discontinued), parameter-heavy URLs that inflate sitemap size with low-value duplicates, and price or availability changes that should trigger <lastmod> updates but often don't.

Check your crawl data weekly for e-commerce. A product going out of stock doesn't need an immediate sitemap removal, but pages returning 404 or redirecting to a dead category should be cleaned up within days, not weeks.

A practical checklist for e-commerce sitemap hygiene:

[ ] Exclude paginated URLs (?page=2, etc.) from the sitemap or use canonical tags
[ ] Exclude parameter-filtered URLs (?color=red, ?sort=price)
[ ] Include only canonical product URLs (not variant URLs unless each variant has distinct content)
[ ] Update <lastmod> when price, description, or primary image changes
[ ] Monitor 404 rates in GSC's Coverage report weekly
[ ] Reconcile sitemap size against actual product count monthly

---

Sitemap data as an input to your content strategy

Here's the signal most content teams miss: the gap in your crawl data is your content strategy's report card.

If 40% of your blog posts sit in "crawled, not indexed" for more than 90 days, Google is telling you the posts don't meet the bar for standalone indexation. That's not a technical problem — it's a content quality and differentiation problem. No amount of resubmission fixes it.

Build a monthly ritual:

Export the "Crawled, not indexed" list from your GSC Coverage report
Filter for pages that have been in that status for 60+ days
Group them by content type or topic cluster
For each group, decide: update to meet the bar, consolidate into a stronger piece, or noindex and remove from sitemap

This is what turns sitemap analytics from passive monitoring into an active editorial process. The pages you noindex free up crawl budget for the ones that deserve attention.

Sitemap health and your chatbot knowledge base

If you're running an AI chatbot trained on your website's content — the way Alee ingests your sitemap to build a knowledge brain — your sitemap health is doubly important. Every URL Googlebot can't index is often also a URL your chatbot's ingestion pipeline struggles to process or retrieve reliably.

A sitemap with broken URLs, redirects pointing to dead pages, or wildly outdated <lastmod> dates creates an incomplete knowledge base. Visitors ask questions that should be answerable — your site has the content — but the chatbot either can't find it or surfaces a stale version from a page that's since been updated.

The fix is the same in both cases: clean sitemaps, accurate <lastmod> tags, and a segmented sitemap structure so both Google and your chatbot pipeline know which content is fresh and authoritative. If you're already keeping your sitemap clean for SEO, your chatbot knowledge base stays current with no extra work.

---

Common mistakes that undercut your sitemap data

Submitting and never revisiting. Crawl data is not a one-time task. Indexed counts change as content gets updated, consolidated, or removed. Check GSC monthly at minimum.

Obsessing over 100% indexation. Not every page should be indexed. Faceted navigation pages, internal search results, thin tag pages — these belong behind a noindex or excluded from the sitemap, not force-submitted until Google relents.

Ignoring small submission errors. A sitemap file that returns a 301 redirect rather than 200, or that has one malformed <loc> entry, won't break everything — but it signals to Google that your sitemap setup is sloppy. Fix errors promptly.

Using `<priority>` without strategy. Setting all pages to <priority>1.0 is equivalent to no priority at all — Google ignores uniform values. Reserve 1.0 for your five or ten most critical pages and tier everything else accordingly.

Not excluding recently noindexed pages. If you add a noindex meta tag to a page, remove it from the sitemap in the same deploy. A URL in your sitemap that carries a noindex tag creates a contradictory signal and wastes crawl budget.

Treating sitemap data as separate from page analytics. The best programs overlay sitemap status with organic traffic data. A page that drops from the index shows up in both your GSC sitemap report and your traffic analytics — connecting those two views is where real diagnosis happens.

---

A practical monthly sitemap analytics routine

Here's a repeatable checklist that takes about 30 minutes:

GSC Sitemaps report — check submitted vs. indexed per sitemap file. Flag files where indexed < 70% of submitted (mature sites) or < 50% (growing sites).
GSC Coverage report — filter by "Submitted via sitemap". Note the "Crawled, not indexed" total vs. last month.
Error triage — open URLs in "Valid with warnings" or error categories. Fix promptly.
Resubmit changed sitemaps — if you've added significant content or changed structure, request a reindex in GSC.
Content review — export the 20 oldest pages in "Crawled, not indexed". Make a consolidation or improvement decision for each.
Log file spot check (sites > 5k pages) — confirm Googlebot is hitting priority pages at least weekly.

This routine closes more real business problems than a typical monthly analytics review. Bookmark the tutorials section for deeper dives into specific tools within this workflow.

---

Frequently asked questions

How long does it take for a submitted sitemap URL to get indexed?

There's no guaranteed timeline. Google typically crawls new sitemaps within a few days, but moving a URL from "discovered" to "indexed" can take hours (high-authority sites) to several weeks (new sites or pages with few internal links). Check the GSC Coverage report at the two-week and six-week marks; if a page is still "discovered, not crawled" after six weeks, investigate crawl budget or internal link depth.

Does submitting a sitemap guarantee indexation?

No. A sitemap submission is a request, not a command. Google decides whether to index a URL based on its own quality assessments — content uniqueness, authority, and duplication relative to existing indexed pages. Tracking indexing status helps you understand which URLs Google is declining and why, so you can improve them rather than resubmitting and hoping for a different result.

Should I submit my sitemap to Bing as well?

Yes. Bing Webmaster Tools has its own sitemap report — submission status, indexed count, crawl status — that runs independently of Google's data. Bing's market share is smaller but non-trivial, especially on desktop. The submission process is identical to GSC: add your property, verify ownership, submit the sitemap URL.

How many URLs can a single sitemap file contain?

The XML Sitemap protocol caps a single file at 50,000 URLs and 50 MB uncompressed. In practice, stay well under that — large files take longer to parse and can delay GSC's "last read" timestamp. If you're approaching 10,000 URLs in one file, split into multiple files under a sitemap index. It also improves crawl data granularity in GSC.

My sitemap shows 500 URLs submitted but only 280 indexed — what's the most likely cause?

The most common causes in order: (1) thin or near-duplicate content Google doesn't consider worthy of standalone indexation; (2) crawl budget exhaustion from too many low-value URLs; (3) internal linking gaps — important pages buried three-plus clicks deep get crawled infrequently; (4) page speed issues causing Googlebot timeouts; (5) canonical tag conflicts pointing elsewhere. Review five to ten non-indexed URLs qualitatively before assuming a technical cause — in most cases the content is the answer.

---

Ready to close the loop between your sitemap and what visitors actually find on your site? [Start free](/signup) with Alee — it reads your sitemap, embeds every page into a searchable knowledge brain, and answers visitor questions instantly, grounded only in your content.

Want to go deeper? The features overview shows how sitemap-synced knowledge bases work in practice, and our pricing page covers plans from a free trial to agency-scale deployments. You can also browse more guides on crawl strategy, indexation, and content quality — or see how Alee compares to similar tools on the Alee vs SiteGPT breakdown.

Build your own AI chatbot with Alee

Train it on your site, embed it anywhere, capture leads 24/7. Free to start.