How to Use Playwright to Scrape Glassdoor Data

SwiftProxy
By - Linh Tran
2025-02-28 15:35:24

How to Use Playwright to Scrape Glassdoor Data

Imagine accessing thousands of job listings, salary information, and employer reviews at your fingertips. That's the power of Glassdoor, one of the most valuable resources for job seekers and recruiters alike. But here's the catch: scraping Glassdoor for its rich data isn't as simple as it sounds. With strong anti-bot measures in place, using traditional scraping methods like requests can quickly land you in the "blocked" zone.
This is where Playwright comes into play. It's not just another scraping tool—it allows us to simulate real human browsing behavior, sidestepping the typical blockers like CAPTCHAs and IP bans. Let's dive into scraping job listings from Glassdoor using Python and Playwright.

Why Use Playwright for Scraping

If you've tried scraping Glassdoor before, you know the challenges. The website's anti-bot mechanisms can easily flag suspicious activity. That's why Playwright is crucial—it lets us control a real browser session, complete with proxies and browser headers to make our requests look genuine. This dramatically reduces the chances of detection and lets you gather data seamlessly.

Necessary Tools and Resources

Before we get into the nitty-gritty of the code, here's what you need to set up:

· Python: We'll be using Python, of course.

· Playwright: The tool that makes browser automation possible.

· lxml: A fast and efficient library for parsing HTML.
To install these dependencies, run the following commands:
pip install playwright lxml
playwright install

How to Scrape Glassdoor Job Listings Effectively

Let's break it down. We're going to use Playwright to launch a browser, navigate to the job listings page, extract the job data, and save it into a CSV file.

Step 1: Configuring Playwright and Making Requests

We begin by launching a browser with Playwright. Don't forget to use a proxy to avoid getting blocked.

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

The key here is launching the browser with a proxy, which will help us bypass detection. We navigate to the desired job listings page, retrieve the HTML content, and then close the browser.

Step 2: Parsing the HTML to Extract Data

Now that we have the page's HTML, we can use lxml to parse and extract the job details.

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

Here, we loop through each job posting, extracting key details like job title, location, salary, and company name.

Step 3: Saving the Data

Once we've collected the job data, we save it into a CSV file for further analysis.

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

This will save all the extracted data in a neat CSV file, which you can then analyze or import into a database.

Full Code in Action

Here's the complete code all put together:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
        
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())

Key Takeaways

· Playwright is the game-changer for scraping Glassdoor. It allows us to bypass detection by simulating real browser behavior.

· Proxies and headers are essential to avoid getting blocked.

· After scraping, you can easily store and analyze the data in CSV format.

Conclusion

Remember, scraping is not a free-for-all. Always ensure that your actions align with Glassdoor's terms of service. To be respectful of their resources, implement rate limits to avoid bombarding the site with too many requests in a short period. Using rotating proxies can help minimize the risk of being flagged, and ethical scraping practices should always be followed. Regularly review the terms of service to ensure that your methods to scrape Glassdoor data comply with them.

About the author

SwiftProxy
Linh Tran
Senior Technology Analyst at Swiftproxy
Linh Tran is a Hong Kong-based technology writer with a background in computer science and over eight years of experience in the digital infrastructure space. At Swiftproxy, she specializes in making complex proxy technologies accessible, offering clear, actionable insights for businesses navigating the fast-evolving data landscape across Asia and beyond.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email