
Imagine accessing thousands of job listings, salary information, and employer reviews at your fingertips. That's the power of Glassdoor, one of the most valuable resources for job seekers and recruiters alike. But here's the catch: scraping Glassdoor for its rich data isn't as simple as it sounds. With strong anti-bot measures in place, using traditional scraping methods like requests can quickly land you in the "blocked" zone.
This is where Playwright comes into play. It's not just another scraping tool—it allows us to simulate real human browsing behavior, sidestepping the typical blockers like CAPTCHAs and IP bans. Let's dive into scraping job listings from Glassdoor using Python and Playwright.
If you've tried scraping Glassdoor before, you know the challenges. The website's anti-bot mechanisms can easily flag suspicious activity. That's why Playwright is crucial—it lets us control a real browser session, complete with proxies and browser headers to make our requests look genuine. This dramatically reduces the chances of detection and lets you gather data seamlessly.
Before we get into the nitty-gritty of the code, here's what you need to set up:
· Python: We'll be using Python, of course.
· Playwright: The tool that makes browser automation possible.
· lxml: A fast and efficient library for parsing HTML.
To install these dependencies, run the following commands:
pip install playwright lxml
playwright install
Let's break it down. We're going to use Playwright to launch a browser, navigate to the job listings page, extract the job data, and save it into a CSV file.
We begin by launching a browser with Playwright. Don't forget to use a proxy to avoid getting blocked.
from playwright.async_api import async_playwright
from lxml.html import fromstring
async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        return content
The key here is launching the browser with a proxy, which will help us bypass detection. We navigate to the desired job listings page, retrieve the HTML content, and then close the browser.
Now that we have the page's HTML, we can use lxml to parse and extract the job details.
parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
    
    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)
Here, we loop through each job posting, extracting key details like job title, location, salary, and company name.
Once we've collected the job data, we save it into a CSV file for further analysis.
import csv
with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)
This will save all the extracted data in a neat CSV file, which you can then analyze or import into a database.
Here's the complete code all put together:
import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring
async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy_server', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()
        await browser.close()
        
        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')
        
        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]
            
            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)
        
        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)
import asyncio
asyncio.run(scrape_job_listings())
Key Takeaways
· Playwright is the game-changer for scraping Glassdoor. It allows us to bypass detection by simulating real browser behavior.
· Proxies and headers are essential to avoid getting blocked.
· After scraping, you can easily store and analyze the data in CSV format.
Remember, scraping is not a free-for-all. Always ensure that your actions align with Glassdoor's terms of service. To be respectful of their resources, implement rate limits to avoid bombarding the site with too many requests in a short period. Using rotating proxies can help minimize the risk of being flagged, and ethical scraping practices should always be followed. Regularly review the terms of service to ensure that your methods to scrape Glassdoor data comply with them.
 Top-tier residential proxy solutions
Top-tier residential proxy solutions {{item.title}}
                                        {{item.title}}