Residential Proxies

Static Residential Proxies

Unlimited Residential Proxies

Learn more

Youtube Proxies

Residential Proxies

Static Residential Proxies

Unlimited Residential Proxies

Data for AI

Web Scraping

SEO and SERP Scraping

Price Monitoring

Travel Fare Aggregation

Stock Market Data Collection

Swiftproxy’s partners

Gather data at scale

Web Scraping Proxies Free Trial

Gather accurate data worldwide without blocks or interruptions.

Learn more >

Unlimited-Bandwidth Proxy Solution for Large-Scale Video Data Collection

Power Your Business Growth with Swiftproxy

A global network of over 80 million residential proxies, ensuring 99.89% uptime and stable connections, supporting HTTP(S) & SOCKS5 protocols.

Swiftproxy residential proxies with 80M+ IPs, 99.89% uptime, supporting HTTP(S) & SOCKS5 protocols

Affiliate program

30% commission guaranteed

CDK Earning Program

Turn your proxies into profit

How to Use Playwright to Scrape Glassdoor Data

By - Linh Tran

2025-02-28 15:35:24

Imagine accessing thousands of job listings, salary information, and employer reviews at your fingertips. That's the power of Glassdoor, one of the most valuable resources for job seekers and recruiters alike. But here's the catch: scraping Glassdoor for its rich data isn't as simple as it sounds. With strong anti-bot measures in place, using traditional scraping methods like requests can quickly land you in the "blocked" zone.
This is where Playwright comes into play. It's not just another scraping tool—it allows us to simulate real human browsing behavior, sidestepping the typical blockers like CAPTCHAs and IP bans. Let's dive into scraping job listings from Glassdoor using Python and Playwright.

Why Use Playwright for Scraping

If you've tried scraping Glassdoor before, you know the challenges. The website's anti-bot mechanisms can easily flag suspicious activity. That's why Playwright is crucial—it lets us control a real browser session, complete with proxies and browser headers to make our requests look genuine. This dramatically reduces the chances of detection and lets you gather data seamlessly.

Necessary Tools and Resources

Before we get into the nitty-gritty of the code, here's what you need to set up:

· Python: We'll be using Python, of course.

· Playwright: The tool that makes browser automation possible.

· lxml: A fast and efficient library for parsing HTML.
To install these dependencies, run the following commands:
pip install playwright lxml
playwright install

How to Scrape Glassdoor Job Listings Effectively

Let's break it down. We're going to use Playwright to launch a browser, navigate to the job listings page, extract the job data, and save it into a CSV file.

Step 1: Configuring Playwright and Making Requests

We begin by launching a browser with Playwright. Don't forget to use a proxy to avoid getting blocked.

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

The key here is launching the browser with a proxy, which will help us bypass detection. We navigate to the desired job listings page, retrieve the HTML content, and then close the browser.

Step 2: Parsing the HTML to Extract Data

Now that we have the page's HTML, we can use lxml to parse and extract the job details.

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

Here, we loop through each job posting, extracting key details like job title, location, salary, and company name.

Step 3: Saving the Data

Once we've collected the job data, we save it into a CSV file for further analysis.

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

This will save all the extracted data in a neat CSV file, which you can then analyze or import into a database.

Full Code in Action

Here's the complete code all put together:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
        
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())

Key Takeaways

· Playwright is the game-changer for scraping Glassdoor. It allows us to bypass detection by simulating real browser behavior.

· Proxies and headers are essential to avoid getting blocked.

· After scraping, you can easily store and analyze the data in CSV format.

Conclusion

Remember, scraping is not a free-for-all. Always ensure that your actions align with Glassdoor's terms of service. To be respectful of their resources, implement rate limits to avoid bombarding the site with too many requests in a short period. Using rotating proxies can help minimize the risk of being flagged, and ethical scraping practices should always be followed. Regularly review the terms of service to ensure that your methods to scrape Glassdoor data comply with them.

About the author

Linh Tran

Senior Technology Analyst at Swiftproxy

Linh Tran is a Hong Kong-based technology writer with a background in computer science and over eight years of experience in the digital infrastructure space. At Swiftproxy, she specializes in making complex proxy technologies accessible, offering clear, actionable insights for businesses navigating the fast-evolving data landscape across Asia and beyond.

The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.

IN THIS ARTICLE

Top-tier residential proxy solutions