More than 90% of the world’s data lives on the web, and most of it isn’t neatly packaged for analysis. It’s buried in HTML. That’s where BeautifulSoup quietly earns its reputation. Web scraping can get complex quickly. But building a solid parser? That part is refreshingly approachable. Python does the heavy lifting, and BeautifulSoup gives you clean, readable access to messy markup without turning your code into a science project. In this BeautifulSoup tutorial, we’ll walk you through parsing HTML with BeautifulSoup, step by step. You’ll start small. Then you’ll level up to dynamic pages rendered with JavaScript using Selenium. By the end, you’ll know how to extract structured data and export it for real analysis. Let’s get hands-on.

Before touching code, make sure your Python environment is ready. Any IDE works, but PyCharm is a solid choice if you want fewer distractions and better debugging out of the box.
On Windows, pay attention during Python installation. Enable the PATH option. This allows commands like python and pip to run globally without pointing to their install directory. It saves time. Every time.
Now install BeautifulSoup 4:
pip install beautifulsoup4
If you're on Windows, running your terminal as administrator avoids permission issues.
Here's the sample HTML file we'll work with. It's intentionally simple, but the same techniques apply to complex production pages.
<!DOCTYPE html>
<html>
<head>
<title>What is a Proxy?</title>
<meta charset="utf-8">
</head>
<body>
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies. However, two of
the most popular types are residential and data center proxies.
</p>
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
Save this file as index.html in your project directory. Once that's done, create a new Python file. This is where the fun starts.
Before extracting anything specific, it helps to understand what's actually there.
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
Run this, and you'll see every tag in order:
html
head
title
meta
body
h2
p
ul
li
li
li
li
li
This step is underrated. It gives you a mental map of the document before you start extracting data blindly.
Want the full HTML of specific elements? BeautifulSoup makes that trivial.
print(soup.h2)
print(soup.p)
print(soup.li)
Output:
<h2>Proxy types</h2>
<p>There are many different ways to categorize proxies...</p>
<li>Residential proxies</li>
If you only want the text, strip the markup:
print(soup.h2.text)
Keep in mind that this returns only the first matching tag. That behavior matters once you're working with lists.
IDs are your best friend when scraping. They're usually unique and stable.
print(soup.find('ul', id='proxytypes'))
or
print(soup.find('ul', attrs={'id': 'proxytypes'}))
Both produce the same result. Use whichever reads better to you.
Lists are common targets. Here's how to extract every <li> cleanly.
for tag in soup.find_all('li'):
print(tag.text)
Output:
Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies
This pattern shows up everywhere in real scraping projects. Master it early.
BeautifulSoup supports CSS selectors through the soupsieve package, installed automatically.
Two methods matter most:
select() returns a listselect_one() returns the first matchExtract the page title:
print(soup.select('html head title'))
Target the first list item:
print(soup.select_one('body ul li'))
Need precision? Use positional selectors:
print(soup.select_one('body ul li:nth-of-type(3)'))
That line grabs “Shared proxies” exactly. No guesswork.
Static HTML is easy. JavaScript changes everything. BeautifulSoup alone can't render JavaScript. For that, you need Selenium.
pip install selenium
Selenium 4.6+ automatically downloads browser drivers. If yours doesn't, you'll need to install the appropriate WebDriver manually.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
This opens a real browser instance. JavaScript runs. Content loads fully.
driver.get("http://quotes.toscrape.com/js/")
js_content = driver.page_source
Now you have rendered HTML, not placeholders.
soup = BeautifulSoup(js_content, "html.parser")
quote = soup.find("span", class_="text")
print(quote.text)
Note the underscore in class_. Without it, Python gets confused.
One warning. Many sites detect Selenium traffic aggressively. IP blocks are common. If that happens, rotating proxies and browser fingerprinting strategies become essential, especially at scale.
Scraping isn't useful unless the data leaves your script.
Install pandas:
pip install pandas
Then export your results:
from bs4 import BeautifulSoup
import pandas as pd
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
results = soup.find_all('li')
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
Run it, and a CSV appears in your project directory. Clean. Structured. Ready for analysis.
You now have the core tools to extract data from both static and dynamic web pages, and to turn that data into a structured CSV. The real power comes when you apply this workflow to your own projects—whether it's price tracking, competitor research, or building a dataset for analysis. Start small, stay consistent, and you'll be surprised how quickly your scraping skills improve.