✨ Train your first AI chatbot free — no credit card neededStart free →
Alee
← All resources
Guides · 13 min read

Sitemap URL Counter: The Complete Practical Guide

Learn how to use a sitemap url counter to audit your site's indexed pages, find crawl bloat, and optimize your XML sitemap for better rankings.

Every SEO audit eventually runs into the same uncomfortable question: how many URLs are actually in that sitemap? Not "about how many" — the exact count, broken down by sitemap file if you're using a sitemap index, flagged if it's over the 50,000-URL limit Google enforces, and cross-referenced against what's actually indexed. A sitemap url counter sounds like a trivial tool. In practice, it's the first diagnostic you reach for when organic traffic drops unexpectedly, when a crawl budget audit reveals wasted Googlebot calls, or when you've just migrated a large site and need to confirm every important URL survived.

Key takeaways

  • Google and Bing cap individual sitemap files at 50,000 URLs and 50 MB uncompressed — exceeding either limit silently breaks submission.
  • Counting sitemap URLs lets you verify limits before submission, catch bloat from paginated or filtered URLs, and prioritize crawl budget toward high-value pages.
  • Sitemap index files can reference up to 50,000 child sitemaps, each capped at 50,000 URLs — so the theoretical ceiling per domain is 2.5 billion submitted URLs, but realistic crawl-budget thinking puts the practical limit far lower.
  • Free tools (XML Sitemaps, Screaming Frog free tier, Google Search Console) cover most needs; paid tools are worth it for 100k+ URL sites with multiple sitemap indexes.
  • High URL counts are often a symptom — the root cause is usually noindex pages, paginated filters, or tag archives leaking into the sitemap generator.
  • If your site uses content from multiple sources (PDFs, knowledge bases, product catalogs), Alee can help you understand which content is actually findable and answerable — useful context when deciding what belongs in a sitemap.

---

Why counting sitemap URLs matters more than you think

When a site has 200 pages, counting sitemap URLs is a ten-second sanity check. When a site has 80,000 product pages, 12,000 tag archives, and a sitemap index pointing to seven child sitemaps — counting becomes a genuine audit discipline.

Three situations make this non-negotiable:

1. You're approaching the 50,000-URL hard limit. Google's sitemap protocol specification states that a single sitemap file must contain no more than 50,000 URLs and be no larger than 50 MB uncompressed. Submit a sitemap with 50,001 URLs and Googlebot may reject the entire file without warning. Search Console will show an error, but you might not notice for days. Counting before you submit catches this.

2. You've inherited a site with unknown structure. Acquisition targets, replatforming projects, and agency handoffs all share the same problem: nobody documented what's in the sitemap. A quick count — especially broken down by sitemap type (products, categories, blog posts) — tells you more about a site's content inventory than a 30-minute stakeholder call.

3. Crawl budget is being wasted. Google crawls a finite number of pages per day on any given site. If your sitemap is padding that budget with low-value URLs — empty category pages, duplicate filtered facets, thin tag archives — you're trading Googlebot's attention away from your money pages. Counting and categorizing sitemap URLs is the first step in reclaiming that budget.

---

How sitemap URL counting works: the mechanics

An XML sitemap is just a text file. Each <url> element in a standard sitemap file represents one URL. In a sitemap index file, each <sitemap> element points to a child sitemap file. Counting URLs means summing the <url> elements across all child files — not just the sitemap index itself.

This distinction trips up a lot of people. If you point a counter at a sitemap index and it tells you "8 URLs," it's counting the eight child sitemap references, not the potentially 120,000 URLs those child files contain. A good sitemap url counter follows all child links recursively and reports both the per-file count and the aggregate total.

What a typical XML sitemap looks like

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/about/</loc>
<lastmod>2026-05-10</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/services/</loc>
<lastmod>2026-06-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
</urlset>
```

Counting the <url> nodes in this file gives you 2. Do that recursively across all files referenced in your sitemap index, and you have the true total.

Sitemap index vs. standard sitemap

| Feature | Standard Sitemap | Sitemap Index |
|---|---|---|
| Contains | <url> elements (actual pages) | <sitemap> elements (pointers to child files) |
| Max entries | 50,000 URLs | 50,000 child sitemaps |
| Max file size | 50 MB uncompressed | 50 MB uncompressed |
| Counting method | Count <url> nodes directly | Follow each <sitemap> → count child <url> nodes |
| Typical use case | Sites under 50k pages | Large sites, multi-type sitemaps |

---

Tools for counting URLs in a sitemap

You don't need to write a script every time. Several free and paid tools handle this well, with different trade-offs.

Free tools

Google Search Console (Coverage + Sitemaps report)
The Sitemaps section in GSC shows you the submitted URL count for each sitemap — and the indexed count alongside it. That gap is often the most important data point in your audit. It's the most authoritative source because it reflects Google's own view of your sitemap.

Limitation: it only shows what you've submitted. For unsubmitted sitemaps or competitor research, GSC won't help.

XML Sitemaps Validator (xmlsitemaps.com)
Paste a sitemap URL and this free tool fetches it, validates the XML, and shows a URL count. It follows child references in sitemap index files. The free tier handles most audits under 500 URLs; larger sitemaps may time out.

Screaming Frog SEO Spider (free tier)
The free version crawls up to 500 URLs, which isn't enough to count a large sitemap's contents. But it's excellent for small-to-medium sites, and the Sitemaps tab will tell you exactly how many URLs it found in each file. The paid version removes the crawl limit and adds bulk export.

curl + grep (terminal, any OS)
The fastest method on a standard sitemap:

```bash
curl -s https://example.com/sitemap.xml | grep -c "<url>"
```

For a sitemap index, fetch each child file and sum the counts. See our tutorials for a ready-made shell script.

Python (for large or complex sitemaps)
Python's requests and xml.etree.ElementTree libraries handle recursive sitemap counting cleanly:

```python
import requests
from xml.etree import ElementTree as ET

NS = "http://www.sitemaps.org/schemas/sitemap/0.9"

def count_sitemap(url, visited=None):
if visited is None:
visited = set()
if url in visited:
return 0
visited.add(url)
r = requests.get(url, timeout=30)
root = ET.fromstring(r.content)
sitemaps = root.findall(f"{{{NS}}}sitemap/{{{NS}}}loc")
if sitemaps:
return sum(count_sitemap(s.text.strip(), visited) for s in sitemaps)
return len(root.findall(f"{{{NS}}}url"))

total = countsitemap("https://example.com/sitemapindex.xml")
print(f"Total URLs: {total}")
```

This recursively follows sitemap index files and returns the true total.

Paid tools worth knowing

Screaming Frog (paid): Best for full site audits where you want URL count plus broken links, redirect chains, and canonical mismatches all in one pass.

Sitebulb: Strong sitemap audit reporting with visual breakdowns by URL type. Better UX for presenting findings to clients.

Ahrefs Site Audit / Semrush Site Audit: If you're already paying for either platform, their crawlers will report sitemap URL counts alongside broader technical SEO metrics. Convenient but not their core strength for this task.

---

Step-by-step: running a sitemap url counter audit

Here's how to run a proper audit, in order.

Step 1: Find your sitemap

If you don't already know the URL, try these standard locations first:

  • https://yourdomain.com/sitemap.xml
  • https://yourdomain.com/sitemap_index.xml
  • https://yourdomain.com/wp-sitemap.xml (WordPress with Yoast or similar)

If those return 404s, check robots.txt (https://yourdomain.com/robots.txt) — the Sitemap: directive should point to the file.

Step 2: Identify whether it's a standard sitemap or an index

Open the file in a browser or fetch it with curl. If the root element is <urlset>, it's a standard sitemap file. If it's <sitemapindex>, it's an index pointing to child files. Knowing this tells you whether to count <url> nodes directly or to recurse.

Step 3: Count and break down by type

Don't stop at the total count. If you're using a sitemap index with separate files for products, categories, posts, and pages, count each child file separately. A breakdown like this is far more useful:

| Sitemap file | URL count | % of total |
|---|---|---|
| sitemap-products.xml | 42,000 | 71% |
| sitemap-categories.xml | 3,200 | 5.4% |
| sitemap-posts.xml | 12,800 | 21.6% |
| sitemap-pages.xml | 1,100 | 1.9% |
| Total | 59,100 | 100% |

That product sitemap is over the 50,000-URL limit and needs to be split. The category count looks high — worth investigating whether those category pages earn traffic.

Step 4: Cross-reference against indexed URLs

Pull the submitted vs. indexed counts from Google Search Console. If you're submitting 59,100 URLs but Google has indexed 18,000, something is blocking indexation at scale. Common causes:

  • Pages blocked by robots.txt accidentally included in the sitemap
  • Noindexed pages included in the sitemap (a technical SEO anti-pattern)
  • Thin or duplicate content Google has chosen to exclude
  • Crawl budget exhaustion on large sites

The indexed-to-submitted ratio is a health metric. Anything below 50% on a site older than six months warrants investigation.

Step 5: Check for URL bloat

High counts aren't always good. Common bloat sources to look for:

  • Faceted navigation: filter URLs like /shoes?color=red&size=9 that add thousands of low-value combinations
  • Pagination: /blog/page/2, /blog/page/3 — rarely worth including
  • Tag and category archives: WordPress sites in particular can generate thousands of thin archive pages
  • Session IDs or tracking parameters left in canonical URLs
  • Staging or development URLs accidentally published to production sitemaps

---

The 50,000-URL limit: what actually happens when you exceed it

Google won't crawl beyond the first 50,000 URLs in a single sitemap file and won't process a file larger than 50 MB uncompressed. The behavior when you exceed it varies:

  • Google Search Console will report an error in the Sitemaps section.
  • Googlebot may still attempt to parse the file but will stop at the limit.
  • URLs beyond position 50,000 in the file simply won't be submitted via this channel — though they may still be crawled through internal links.

The fix is straightforward: split the oversized sitemap into multiple child files and reference them from a sitemap index. Most sitemap generators handle this automatically once you configure the split threshold.

Bing Webmaster Tools enforces the same limits and has the same behavior at the margin.

---

Sitemap URL counts by platform: what to expect

Different CMS and e-commerce platforms generate very different default URL counts.

WordPress (with Yoast SEO or Rank Math)
Generates a sitemap index automatically. Posts, pages, categories, tags, authors, and custom post types each get their own child file. A typical 500-post blog might generate 2,000–4,000 sitemap URLs once you include all archive types. Rank Math lets you disable specific post types from the sitemap in one click — useful for keeping the count lean.

Shopify
Shopify generates a sitemap index at /sitemap.xml with child files for products, collections, blogs, and pages. A store with 10,000 products and 500 collections might submit 12,000–15,000 URLs total. Shopify automatically handles the 50,000-URL split for large catalogs.

Wix
Wix generates sitemaps automatically and updates them dynamically. You have limited control over what's included. Larger sites sometimes include dynamic page variants that inflate the count unnecessarily.

Next.js / Headless sites
Sitemaps are generated programmatically. Libraries like next-sitemap handle this well but need explicit configuration to exclude draft pages, API routes, and admin paths. Misconfiguration goes unnoticed until you run a count audit.

Magento / large e-commerce
Large catalogs with configurable products can generate hundreds of thousands of sitemap entries. Submit the canonical product page, not every attribute variation.

---

Common mistakes when running a sitemap url counter

Counting the index file instead of the child files. A sitemap index with 8 child sitemaps will show "8" if you count its <sitemap> elements. That's not 8 URLs — it's 8 pointers to files that may contain 100,000 URLs combined. Always recurse into child files.

Treating a high count as a success. More URLs in your sitemap is not better. The right number is "all important, indexable, canonical pages and no others." Padding the sitemap with low-value URLs tells Google nothing useful and potentially dilutes crawl attention.

Ignoring the submitted vs. indexed gap. Knowing what you're submitting doesn't tell you what Google is accepting. Always pair the count audit with a Search Console check.

Not checking after CMS updates or plugin changes. Sitemap generators can silently change behavior after updates. Build a periodic check into your technical SEO calendar.

Submitting duplicate URLs with and without trailing slashes. Many sitemap generators include both https://example.com/page/ and https://example.com/page as separate entries. Fix this at the server level with a 301 redirect and ensure the generator uses one form consistently.

Compare how different platforms handle these edge cases on our compare page or explore platform breakdowns in our resources section.

---

When to use a sitemap url counter for competitor research

You can run a count against any publicly accessible sitemap — including competitors'. It's standard SEO research. Here's what the data tells you:

  • Content scale: 80,000 blog URLs signals heavy content investment — useful to know before planning your own content calendar.
  • Content types: The child sitemap breakdown (products vs. posts vs. pages) reveals their publishing strategy.
  • Update frequency: <lastmod> dates show how actively they publish. Stale dates suggest a site in maintenance mode.
  • Priorities: URLs in the sitemap are ones the site owner considers worth indexing. Dedicated child files often mark their highest-priority content types.

Find a competitor's sitemap via robots.txt — most CMS platforms list it there. See the resources section for a full breakdown of competitor analysis techniques.

---

Sitemap URL counts and AI chatbot content audits

When you're building or auditing a knowledge base for an AI chatbot — training it on your website's content — the sitemap is your most reliable inventory of what's published and indexable.

Tools like Alee can ingest your website content by URL, sitemap, or direct upload. If you're feeding a sitemap-based crawl into a chatbot's knowledge base, the URL count matters: too many thin or duplicate pages degrade the quality of retrieved answers. A chatbot trained on 8,000 URLs where 5,000 are near-duplicate category pages will produce vague, unfocused answers.

Running a count audit before connecting your site to a chatbot platform is the same discipline as running it before an SEO audit — it tells you what's actually there, not what you think is there. Cleaning up the sitemap (removing noindex pages, eliminating thin archives, fixing duplicates) before training a knowledge base improves both SEO and chatbot answer quality at once.

If you're managing multiple client sites, Alee's Agency plan lets you run separate knowledge bases per client, each trained only on that client's clean, relevant content. A clean sitemap is a useful prerequisite to that onboarding step.

---

Automating sitemap URL count checks

For sites that publish frequently or run large e-commerce catalogs, a quarterly manual check isn't enough. Three options:

Cron job with Python: Adapt the counting script above to run weekly. Log totals per file and alert when the count shifts by more than ±5%.

Google Search Console API: Exposes submitted and indexed URL counts programmatically. Pull it daily and chart the trend.

Third-party monitoring (ContentKing, Conductor): Tracks sitemap changes in real time and sends alerts for URL additions, removals, and count shifts. Higher cost, lower maintenance than rolling your own.

For most sites, a monthly manual check plus a lightweight cron log is sufficient. Browse more automation approaches in our tutorials.

---

Checklist: sitemap URL audit

Use this before any major site change or monthly during active publishing periods:

  • [ ] Locate the sitemap at /sitemap.xml or via robots.txt
  • [ ] Determine: standard sitemap or sitemap index?
  • [ ] Count total URLs across all child files (not just index pointers)
  • [ ] Break down count by sitemap type (posts, products, categories, pages)
  • [ ] Check no single file exceeds 50,000 URLs
  • [ ] Compare submitted count to indexed count in Google Search Console
  • [ ] Investigate indexed-to-submitted ratio if below 70%
  • [ ] Scan for paginated, filtered, or archive URLs that shouldn't be included
  • [ ] Check <lastmod> values are accurate and not all identical (a sign of a mis-configured generator)
  • [ ] Confirm all submitted URLs are canonical (no noindex pages, no redirect URLs)
  • [ ] Rerun after any CMS update or sitemap plugin change

---

Frequently asked questions

How do I count URLs in an XML sitemap quickly?

The fastest method for a single sitemap file is curl -s https://example.com/sitemap.xml | grep -c "<url>" in a terminal. For a sitemap index with multiple child files, you need to recurse into each child and sum the counts — use the Python script in the "Tools" section above, or a dedicated tool like Screaming Frog. Google Search Console also displays the submitted URL count for any sitemap you've submitted, which is accurate and authoritative for your own site.

What is the maximum number of URLs allowed in a sitemap?

Google and Bing both enforce a limit of 50,000 URLs per sitemap file and a maximum uncompressed file size of 50 MB. If you exceed either limit, split the sitemap into multiple files and reference them from a sitemap index file. The sitemap index itself can reference up to 50,000 child sitemaps, but in practice you'd rarely have more than a dozen.

Why does my sitemap URL count differ from what's indexed in Google Search Console?

This is common and has several causes: pages blocked by robots.txt that are included in the sitemap, pages marked noindex that the sitemap generator is including, thin or duplicate content Google has decided not to index, or crawl budget constraints on large sites. The gap between submitted and indexed URLs is one of the most diagnostic numbers in technical SEO.

Should I include all my site's URLs in the sitemap?

No. Include only canonical, indexable, non-redirect URLs that have meaningful content and that you want Google to discover and rank. Exclude: paginated archive pages beyond page 1, filtered facet URLs, noindex pages, redirect URLs, admin or login pages, and any URL that isn't the canonical version of its content. A smaller, cleaner sitemap generally performs better than a padded one.

How often should I audit my sitemap URL count?

For actively publishing sites (daily or weekly posts), audit monthly. For stable brochure sites, quarterly is usually enough. Always audit after a CMS migration, a major plugin update, or any infrastructure change that could affect URL structure. If you're preparing to submit a site to Google Search Console for the first time — or resubmitting after a migration — audit the count before submission.

---

Ready to put your content to work beyond SEO? Alee turns your cleanest, highest-value pages into an AI chatbot that answers visitor questions with pinpoint accuracy — no hallucinations, just answers grounded in your actual content. It's a natural next step once you've audited what's actually worth indexing.

Build your own AI chatbot with Alee

Train it on your site, embed it anywhere, capture leads 24/7. Free to start.

Related reading