How to Crawl Sitemaps Efficiently with Python

SwiftProxy
By - Emily Chan
2025-09-05 15:18:39

How to Crawl Sitemaps Efficiently with Python

Websites can contain tens of thousands or even hundreds of thousands of pages. Crawling them manually is a nightmare. Sitemaps offer a smarter and faster solution because they act as a website's blueprint, showing exactly which pages exist.

Sitemaps can save you hours or even days of scraping work. Instead of jumping from link to link, you can collect every URL in an organized way. However, there is a complication. Many websites use index sitemaps that reference other sitemaps. Parsing these manually is both tedious and prone to errors.

Enter ultimate-sitemap-parser (usp). This Python library will take the hassle out of sitemap crawling. Let's walk through how to use usp to crawl the ASOS sitemap and extract every available URL in minutes.

Foundational Requirements

Before diving in, make sure you have the basics in place:

1. Install Python

You'll need Python installed. If you don't have it yet:

Download and install the latest version from python.org.

Verify the installation:

python3 --version

2. Install ultimate-sitemap-parser

Next, grab the usp library:

pip install ultimate-sitemap-parser

How to Crawl Sitemaps with ultimate-sitemap-parser

With usp installed, you're ready to extract URLs from ASOS—or any site. Here's how.

1. Grab Sitemap and Extract URLs

Parsing XML manually is a pain. With usp, it's just a few lines:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Boom. That's it. All URLs, fetched and ready to use.

2. Manage Nested Sitemaps Automatically

Many sites, such as ASOS, divide their sitemaps into different sections for products, categories, and blogs. Normally, you'd have to crawl each one individually. Not with usp.

It will:

Detect index sitemaps.

Fetch child sitemaps automatically.

Return every URL across the site.

No extra loops. No messy recursion. Just results.

3. Extract Only the URLs You Need

Want just product pages? Easy. Filter by URL patterns:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Targeted extraction. Minimal effort. Maximum efficiency.

4. Save URLs to a File

Instead of printing URLs, store them for analysis:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now you've got a complete, ready-to-analyze CSV of every page.

Wrapping Up

With ultimate-sitemap-parser, crawling sitemaps becomes effortless. It quickly extracts all URLs, automatically handles nested sitemaps, and precisely filters and saves the content you need. Whether it's for SEO audits, competitive analysis, or large-scale website scraping, USP makes a tedious task efficient and predictable.

Note sur l'auteur

SwiftProxy
Emily Chan
Rédactrice en chef chez Swiftproxy
Emily Chan est la rédactrice en chef chez Swiftproxy, avec plus de dix ans d'expérience dans la technologie, les infrastructures numériques et la communication stratégique. Basée à Hong Kong, elle combine une connaissance régionale approfondie avec une voix claire et pratique pour aider les entreprises à naviguer dans le monde en évolution des solutions proxy et de la croissance basée sur les données.
Le contenu fourni sur le blog Swiftproxy est destiné uniquement à des fins d'information et est présenté sans aucune garantie. Swiftproxy ne garantit pas l'exactitude, l'exhaustivité ou la conformité légale des informations contenues, ni n'assume de responsabilité pour le contenu des sites tiers référencés dans le blog. Avant d'engager toute activité de scraping web ou de collecte automatisée de données, il est fortement conseillé aux lecteurs de consulter un conseiller juridique qualifié et de revoir les conditions d'utilisation applicables du site cible. Dans certains cas, une autorisation explicite ou un permis de scraping peut être requis.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email