Ever faced a website with thousands of entries? Endless scrolling, multiple pages—how can you grab every single item without leaving anything behind? Welcome to the world of pagination. From eCommerce catalogs to social media feeds, it’s everywhere. Handling it properly is key to getting complete, accurate data. Websites use pagination to split content into manageable chunks. Load too many items at once, and the page slows down—or even crashes. Pagination improves performance and usability, but for web scrapers, it introduces an extra layer of complexity. Each page has to be captured, tracked, and processed without duplicating or skipping data. Some sites are simple—“Next” or numbered page links. Others rely on infinite scrolling, AJAX, or API-based responses. Each requires a different approach.

Websites organize content in several common ways. Knowing which pattern you're dealing with is the first step.
Straightforward and common on older websites. Scrapers follow anchor tags like Next or Previous until no more pages remain. Simple, reliable, and perfect for small datasets.
Used on eCommerce or news sites. URLs often include query parameters like ?page=2 or andp=3. Scrapers loop through numbers, incrementing parameters to reach every page. Easy to implement when the URL structure is consistent.
Sites like Instagram, Twitter, or YouTube load content dynamically as users scroll. No buttons, no page numbers. Here, scrapers must simulate scrolling using tools like Playwright or Selenium, waiting for new elements to appear before continuing.
A hybrid approach: clicking the button fetches additional content without changing the URL. Scrapers must repeatedly click the button or replicate the underlying network request. Pinterest and SoundCloud often use this pattern.
Some sites expose structured data through APIs with parameters like page, limit, or cursor. Scraping APIs is the cleanest method: fast, reliable, and structured. Platforms like Reddit, GitHub, and Shopify stores often provide this.
Dropdowns, arrows, tabbed pagination, or ellipses. Visual differences may vary, but logic remains the same: segment content for controlled loading.
Before writing code, you need to inspect how the site loads new content. Here's a practical workflow:
Check HTML elements near the bottom of the page. Look for anchor tags, query parameters (?page=2), or buttons like Load more. Identify classes or attributes (data-page, aria-label) that control navigation.
Monitor the Network tab while interacting with pagination. Look for XHR or Fetch requests returning JSON—these reveal API endpoints you can target directly. Track recurring parameters (page, offset, cursor) to see how pagination progresses.
Simulate scrolling with window.scrollTo(0, document.body.scrollHeight) to check if new content loads dynamically. If it does, you likely need a browser automation tool.
Search for JavaScript functions like loadMore, nextPage, or similar. They often control asynchronous content loading.
Different pagination types require different strategies. Here's how to handle them effectively.
When pages follow predictable URL patterns:
import requests
from bs4 import BeautifulSoup
pages = 5
for i in range(1, pages + 1):
url = f"https://books.toscrape.com/catalogue/page-{i}.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select(".product_pod")
print(f"Page {i}: Found {len(items)} products")
For sites without page numbers:
from playwright.sync_api import sync_playwright
MAX_PAGES = 5
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://books.toscrape.com/catalogue/page-1.html")
current_page = 1
while True:
print("Scraping current page...")
titles = page.query_selector_all(".product_pod h3 a")
for t in titles:
print("-", t.inner_text())
if current_page >= MAX_PAGES:
break
next_btn = page.locator("li.next a")
if not next_btn.is_visible():
break
next_btn.click()
page.wait_for_timeout(2000)
current_page += 1
Simulate scrolling or clicks to load all items:
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://infinite-scroll.com/demo/full-page/")
previous_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
new_height = page.evaluate("document.body.scrollHeight")
if new_height == previous_height:
break
previous_height = new_height
print("All results loaded.")
Fetch data directly from JSON endpoints:
import requests
base_url = "https://dummyjson.com/products"
params = {"page": 1, "limit": 50}
max_pages = 10
for _ in range(max_pages):
response = requests.get(base_url, params=params)
data = response.json()
items = data.get("products", [])
if not items:
break
print(f"Fetched {len(items)} items from page {params['page']}")
params["page"] += 1
Unknown Number of Pages: Stop when results are fewer than expected or the “Next” button disappears. Add a maximum page limit to avoid infinite loops.
JavaScript/AJAX Content: Traditional scraping won't see dynamically loaded elements. Use Playwright or Selenium.
Session Data and Cookies: Store cookies/tokens for authenticated pages. Refresh periodically to avoid session expiry.
Hybrid Pagination: Combine techniques for sites mixing "Load More," filters, or tabs.
Rate Limiting and Backoff: Mimic human browsing to avoid blocks. Use random delays or exponential backoff.
Respect Guidelines: Check robots.txt and terms of service.
Error Handling and Retries: Handle timeouts, failed requests, or CAPTCHAs gracefully.
Deduplicate Data: Use unique identifiers, check counts, and validate consistency.
Beautiful Soup + Requests: Static HTML, simple URL-based pagination.
Selenium / Playwright: Dynamic JavaScript-driven sites, infinite scroll, or buttons.
Scrapy: Scalable crawlers, asynchronous handling, automatic pagination.
aiohttp: Async performance for multiple API requests.
Web Scraping APIs: Managed solutions with proxies, JS rendering, and pre-built templates.
Next-page button disappears or is disabled.
Latest request returns empty or duplicate data.
Items retrieved are fewer than expected.
Set a maximum page limit as a safeguard.
Pagination doesn't need to hold you back. By using the right tools and strategies, you can gather data from multiple pages both reliably and efficiently. It's important to be precise, track your requests carefully, respect the website's limits, and check that your data is accurate. Python allows you to handle everything from simple URL loops to complex dynamic scrapes. With careful planning, even very large websites can be managed effectively.