How to Crawl Sitemaps Efficiently with Python

SwiftProxy
By - Emily Chan
2025-09-05 15:18:39

How to Crawl Sitemaps Efficiently with Python

Websites can contain tens of thousands or even hundreds of thousands of pages. Crawling them manually is a nightmare. Sitemaps offer a smarter and faster solution because they act as a website's blueprint, showing exactly which pages exist.

Sitemaps can save you hours or even days of scraping work. Instead of jumping from link to link, you can collect every URL in an organized way. However, there is a complication. Many websites use index sitemaps that reference other sitemaps. Parsing these manually is both tedious and prone to errors.

Enter ultimate-sitemap-parser (usp). This Python library will take the hassle out of sitemap crawling. Let's walk through how to use usp to crawl the ASOS sitemap and extract every available URL in minutes.

Foundational Requirements

Before diving in, make sure you have the basics in place:

1. Install Python

You'll need Python installed. If you don't have it yet:

Download and install the latest version from python.org.

Verify the installation:

python3 --version

2. Install ultimate-sitemap-parser

Next, grab the usp library:

pip install ultimate-sitemap-parser

How to Crawl Sitemaps with ultimate-sitemap-parser

With usp installed, you're ready to extract URLs from ASOS—or any site. Here's how.

1. Grab Sitemap and Extract URLs

Parsing XML manually is a pain. With usp, it's just a few lines:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Boom. That's it. All URLs, fetched and ready to use.

2. Manage Nested Sitemaps Automatically

Many sites, such as ASOS, divide their sitemaps into different sections for products, categories, and blogs. Normally, you'd have to crawl each one individually. Not with usp.

It will:

Detect index sitemaps.

Fetch child sitemaps automatically.

Return every URL across the site.

No extra loops. No messy recursion. Just results.

3. Extract Only the URLs You Need

Want just product pages? Easy. Filter by URL patterns:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Targeted extraction. Minimal effort. Maximum efficiency.

4. Save URLs to a File

Instead of printing URLs, store them for analysis:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now you've got a complete, ready-to-analyze CSV of every page.

Wrapping Up

With ultimate-sitemap-parser, crawling sitemaps becomes effortless. It quickly extracts all URLs, automatically handles nested sitemaps, and precisely filters and saves the content you need. Whether it's for SEO audits, competitive analysis, or large-scale website scraping, USP makes a tedious task efficient and predictable.

About the author

SwiftProxy
Emily Chan
Lead Writer at Swiftproxy
Emily Chan is the lead writer at Swiftproxy, bringing over a decade of experience in technology, digital infrastructure, and strategic communications. Based in Hong Kong, she combines regional insight with a clear, practical voice to help businesses navigate the evolving world of proxy solutions and data-driven growth.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Frequently Asked Questions

How to Crawl Sitemaps Efficiently with Python

Websites can contain tens of thousands or even hundreds of thousands of pages. Crawling them manually is a nightmare. Sitemaps offer a smarter and faster solution because they act as a website's blueprint, showing exactly which pages exist.

Sitemaps can save you hours or even days of scraping work. Instead of jumping from link to link, you can collect every URL in an organized way. However, there is a complication. Many websites use index sitemaps that reference other sitemaps. Parsing these manually is both tedious and prone to errors.

Enter ultimate-sitemap-parser (usp). This Python library will take the hassle out of sitemap crawling. Let's walk through how to use usp to crawl the ASOS sitemap and extract every available URL in minutes.

Foundational Requirements

Before diving in, make sure you have the basics in place:

1. Install Python

You'll need Python installed. If you don't have it yet:

Download and install the latest version from python.org.

Verify the installation:

python3 --version

2. Install ultimate-sitemap-parser

Next, grab the usp library:

pip install ultimate-sitemap-parser

How to Crawl Sitemaps with ultimate-sitemap-parser

With usp installed, you're ready to extract URLs from ASOS—or any site. Here's how.

1. Grab Sitemap and Extract URLs

Parsing XML manually is a pain. With usp, it's just a few lines:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Boom. That's it. All URLs, fetched and ready to use.

2. Manage Nested Sitemaps Automatically

Many sites, such as ASOS, divide their sitemaps into different sections for products, categories, and blogs. Normally, you'd have to crawl each one individually. Not with usp.

It will:

Detect index sitemaps.

Fetch child sitemaps automatically.

Return every URL across the site.

No extra loops. No messy recursion. Just results.

3. Extract Only the URLs You Need

Want just product pages? Easy. Filter by URL patterns:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Targeted extraction. Minimal effort. Maximum efficiency.

4. Save URLs to a File

Instead of printing URLs, store them for analysis:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now you've got a complete, ready-to-analyze CSV of every page.

Wrapping Up

With ultimate-sitemap-parser, crawling sitemaps becomes effortless. It quickly extracts all URLs, automatically handles nested sitemaps, and precisely filters and saves the content you need. Whether it's for SEO audits, competitive analysis, or large-scale website scraping, USP makes a tedious task efficient and predictable.

Show more
Show less
SwiftProxy SwiftProxy SwiftProxy
SwiftProxy