How to Crawl Sitemaps Efficiently with Python

SwiftProxy
By - Emily Chan
2025-09-05 15:18:39

How to Crawl Sitemaps Efficiently with Python

Websites can contain tens of thousands or even hundreds of thousands of pages. Crawling them manually is a nightmare. Sitemaps offer a smarter and faster solution because they act as a website's blueprint, showing exactly which pages exist.

Sitemaps can save you hours or even days of scraping work. Instead of jumping from link to link, you can collect every URL in an organized way. However, there is a complication. Many websites use index sitemaps that reference other sitemaps. Parsing these manually is both tedious and prone to errors.

Enter ultimate-sitemap-parser (usp). This Python library will take the hassle out of sitemap crawling. Let's walk through how to use usp to crawl the ASOS sitemap and extract every available URL in minutes.

Foundational Requirements

Before diving in, make sure you have the basics in place:

1. Install Python

You'll need Python installed. If you don't have it yet:

Download and install the latest version from python.org.

Verify the installation:

python3 --version

2. Install ultimate-sitemap-parser

Next, grab the usp library:

pip install ultimate-sitemap-parser

How to Crawl Sitemaps with ultimate-sitemap-parser

With usp installed, you're ready to extract URLs from ASOS—or any site. Here's how.

1. Grab Sitemap and Extract URLs

Parsing XML manually is a pain. With usp, it's just a few lines:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Boom. That's it. All URLs, fetched and ready to use.

2. Manage Nested Sitemaps Automatically

Many sites, such as ASOS, divide their sitemaps into different sections for products, categories, and blogs. Normally, you'd have to crawl each one individually. Not with usp.

It will:

Detect index sitemaps.

Fetch child sitemaps automatically.

Return every URL across the site.

No extra loops. No messy recursion. Just results.

3. Extract Only the URLs You Need

Want just product pages? Easy. Filter by URL patterns:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Targeted extraction. Minimal effort. Maximum efficiency.

4. Save URLs to a File

Instead of printing URLs, store them for analysis:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now you've got a complete, ready-to-analyze CSV of every page.

Wrapping Up

With ultimate-sitemap-parser, crawling sitemaps becomes effortless. It quickly extracts all URLs, automatically handles nested sitemaps, and precisely filters and saves the content you need. Whether it's for SEO audits, competitive analysis, or large-scale website scraping, USP makes a tedious task efficient and predictable.

關於作者

SwiftProxy
Emily Chan
Swiftproxy首席撰稿人
Emily Chan是Swiftproxy的首席撰稿人,擁有十多年技術、數字基礎設施和戰略傳播的經驗。她常駐香港,結合區域洞察力和清晰實用的表達,幫助企業駕馭不斷變化的代理IP解決方案和數據驅動增長。
Swiftproxy部落格提供的內容僅供參考,不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性,也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前,強烈建議諮詢合格的法律顧問,並仔細閱讀目標網站的服務條款。在某些情況下,可能需要明確授權或抓取許可。
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email