How to Use Playwright to Scrape Glassdoor Data

SwiftProxy
By - Linh Tran
2025-02-28 15:35:24

How to Use Playwright to Scrape Glassdoor Data

Imagine accessing thousands of job listings, salary information, and employer reviews at your fingertips. That's the power of Glassdoor, one of the most valuable resources for job seekers and recruiters alike. But here's the catch: scraping Glassdoor for its rich data isn't as simple as it sounds. With strong anti-bot measures in place, using traditional scraping methods like requests can quickly land you in the "blocked" zone.
This is where Playwright comes into play. It's not just another scraping tool—it allows us to simulate real human browsing behavior, sidestepping the typical blockers like CAPTCHAs and IP bans. Let's dive into scraping job listings from Glassdoor using Python and Playwright.

Why Use Playwright for Scraping

If you've tried scraping Glassdoor before, you know the challenges. The website's anti-bot mechanisms can easily flag suspicious activity. That's why Playwright is crucial—it lets us control a real browser session, complete with proxies and browser headers to make our requests look genuine. This dramatically reduces the chances of detection and lets you gather data seamlessly.

Necessary Tools and Resources

Before we get into the nitty-gritty of the code, here's what you need to set up:

· Python: We'll be using Python, of course.

· Playwright: The tool that makes browser automation possible.

· lxml: A fast and efficient library for parsing HTML.
To install these dependencies, run the following commands:
pip install playwright lxml
playwright install

How to Scrape Glassdoor Job Listings Effectively

Let's break it down. We're going to use Playwright to launch a browser, navigate to the job listings page, extract the job data, and save it into a CSV file.

Step 1: Configuring Playwright and Making Requests

We begin by launching a browser with Playwright. Don't forget to use a proxy to avoid getting blocked.

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

The key here is launching the browser with a proxy, which will help us bypass detection. We navigate to the desired job listings page, retrieve the HTML content, and then close the browser.

Step 2: Parsing the HTML to Extract Data

Now that we have the page's HTML, we can use lxml to parse and extract the job details.

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

Here, we loop through each job posting, extracting key details like job title, location, salary, and company name.

Step 3: Saving the Data

Once we've collected the job data, we save it into a CSV file for further analysis.

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

This will save all the extracted data in a neat CSV file, which you can then analyze or import into a database.

Full Code in Action

Here's the complete code all put together:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
        
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())

Key Takeaways

· Playwright is the game-changer for scraping Glassdoor. It allows us to bypass detection by simulating real browser behavior.

· Proxies and headers are essential to avoid getting blocked.

· After scraping, you can easily store and analyze the data in CSV format.

Conclusion

Remember, scraping is not a free-for-all. Always ensure that your actions align with Glassdoor's terms of service. To be respectful of their resources, implement rate limits to avoid bombarding the site with too many requests in a short period. Using rotating proxies can help minimize the risk of being flagged, and ethical scraping practices should always be followed. Regularly review the terms of service to ensure that your methods to scrape Glassdoor data comply with them.

關於作者

SwiftProxy
Linh Tran
Swiftproxy高級技術分析師
Linh Tran是一位駐香港的技術作家,擁有計算機科學背景和超過八年的數字基礎設施領域經驗。在Swiftproxy,她專注於讓複雜的代理技術變得易於理解,為企業提供清晰、可操作的見解,助力他們在快速發展的亞洲及其他地區數據領域中導航。
Swiftproxy部落格提供的內容僅供參考,不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性,也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前,強烈建議諮詢合格的法律顧問,並仔細閱讀目標網站的服務條款。在某些情況下,可能需要明確授權或抓取許可。
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email