BeautifulSoup Tutorial: Extract Structured Data from HTML

More than 90% of the world’s data lives on the web, and most of it isn’t neatly packaged for analysis. It’s buried in HTML. That’s where BeautifulSoup quietly earns its reputation. Web scraping can get complex quickly. But building a solid parser? That part is refreshingly approachable. Python does the heavy lifting, and BeautifulSoup gives you clean, readable access to messy markup without turning your code into a science project. In this BeautifulSoup tutorial, we’ll walk you through parsing HTML with BeautifulSoup, step by step. You’ll start small. Then you’ll level up to dynamic pages rendered with JavaScript using Selenium. By the end, you’ll know how to extract structured data and export it for real analysis. Let’s get hands-on.

SwiftProxy
By - Linh Tran
2026-01-26 15:30:33

BeautifulSoup Tutorial: Extract Structured Data from HTML

1. Install BeautifulSoup

Before touching code, make sure your Python environment is ready. Any IDE works, but PyCharm is a solid choice if you want fewer distractions and better debugging out of the box.

On Windows, pay attention during Python installation. Enable the PATH option. This allows commands like python and pip to run globally without pointing to their install directory. It saves time. Every time.

Now install BeautifulSoup 4:

pip install beautifulsoup4

If you're on Windows, running your terminal as administrator avoids permission issues.

2. Inspect the HTML You're Parsing

Here's the sample HTML file we'll work with. It's intentionally simple, but the same techniques apply to complex production pages.

<!DOCTYPE html>
<html>
    <head>
        <title>What is a Proxy?</title>
        <meta charset="utf-8">
    </head>

    <body>
        <h2>Proxy types</h2>

        <p>
          There are many different ways to categorize proxies. However, two of
          the most popular types are residential and data center proxies.
        </p>

        <ul id="proxytypes">
            <li>Residential proxies</li>
            <li>Datacenter proxies</li>
            <li>Shared proxies</li>
            <li>Semi-dedicated proxies</li>
            <li>Private proxies</li>
        </ul>
    </body>
</html>

Save this file as index.html in your project directory. Once that's done, create a new Python file. This is where the fun starts.

3. Discover All Tags in the Document

Before extracting anything specific, it helps to understand what's actually there.

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")

    for child in soup.descendants:
        if child.name:
            print(child.name)

Run this, and you'll see every tag in order:

html
head
title
meta
body
h2
p
ul
li
li
li
li
li

This step is underrated. It gives you a mental map of the document before you start extracting data blindly.

4. Extract Full Content From Tags

Want the full HTML of specific elements? BeautifulSoup makes that trivial.

print(soup.h2)
print(soup.p)
print(soup.li)

Output:

<h2>Proxy types</h2>
<p>There are many different ways to categorize proxies...</p>
<li>Residential proxies</li>

If you only want the text, strip the markup:

print(soup.h2.text)

Keep in mind that this returns only the first matching tag. That behavior matters once you're working with lists.

5. Discover Elements by ID

IDs are your best friend when scraping. They're usually unique and stable.

print(soup.find('ul', id='proxytypes'))

or

print(soup.find('ul', attrs={'id': 'proxytypes'}))

Both produce the same result. Use whichever reads better to you.

6. Extract All Instances of a Tag

Lists are common targets. Here's how to extract every <li> cleanly.

for tag in soup.find_all('li'):
    print(tag.text)

Output:

Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies

This pattern shows up everywhere in real scraping projects. Master it early.

7. Parse Using CSS Selectors

BeautifulSoup supports CSS selectors through the soupsieve package, installed automatically.

Two methods matter most:

  • select() returns a list
  • select_one() returns the first match

Extract the page title:

print(soup.select('html head title'))

Target the first list item:

print(soup.select_one('body ul li'))

Need precision? Use positional selectors:

print(soup.select_one('body ul li:nth-of-type(3)'))

That line grabs “Shared proxies” exactly. No guesswork.

8. Parse Dynamic Pages With Selenium

Static HTML is easy. JavaScript changes everything. BeautifulSoup alone can't render JavaScript. For that, you need Selenium.

Step 1: Install Selenium

pip install selenium

Selenium 4.6+ automatically downloads browser drivers. If yours doesn't, you'll need to install the appropriate WebDriver manually.

Step 2: Import Dependencies

from selenium import webdriver
from bs4 import BeautifulSoup

Step 3: Open the Browser

driver = webdriver.Chrome()

This opens a real browser instance. JavaScript runs. Content loads fully.

Step 4: Fetch a Dynamic Page

driver.get("http://quotes.toscrape.com/js/")
js_content = driver.page_source

Now you have rendered HTML, not placeholders.

Step 5: Parse With BeautifulSoup

soup = BeautifulSoup(js_content, "html.parser")
quote = soup.find("span", class_="text")
print(quote.text)

Note the underscore in class_. Without it, Python gets confused.

One warning. Many sites detect Selenium traffic aggressively. IP blocks are common. If that happens, rotating proxies and browser fingerprinting strategies become essential, especially at scale.

9. Export Parsed Data to CSV

Scraping isn't useful unless the data leaves your script.

Install pandas:

pip install pandas

Then export your results:

from bs4 import BeautifulSoup
import pandas as pd

with open('index.html', 'r') as f:
    contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")
    results = soup.find_all('li')

    df = pd.DataFrame({'Names': results})
    df.to_csv('names.csv', index=False, encoding='utf-8')

Run it, and a CSV appears in your project directory. Clean. Structured. Ready for analysis.

Final Thoughts

You now have the core tools to extract data from both static and dynamic web pages, and to turn that data into a structured CSV. The real power comes when you apply this workflow to your own projects—whether it's price tracking, competitor research, or building a dataset for analysis. Start small, stay consistent, and you'll be surprised how quickly your scraping skills improve.

About the author

SwiftProxy
Linh Tran
Senior Technology Analyst at Swiftproxy
Linh Tran is a Hong Kong-based technology writer with a background in computer science and over eight years of experience in the digital infrastructure space. At Swiftproxy, she specializes in making complex proxy technologies accessible, offering clear, actionable insights for businesses navigating the fast-evolving data landscape across Asia and beyond.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Frequently Asked Questions
{{item.content}}
Show more
Show less
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email