How to get and resolve proxy from URL in Python 3 environment

SwiftProxy
By - Emily Chan
2024-12-09 20:33:16

In the process of web crawling and data collection, proxy servers play a vital role. They can help us bypass IP restrictions, hide our true identity, and improve the efficiency of crawling. This article will detail how to obtain and parse proxy information from URLs in a Python 3 environment for use in subsequent crawling tasks.

What is a proxy?

A proxy server is an intermediary server located between a client and a server. It receives requests from clients, forwards them to the target server, and returns the server's response to the client. Using a proxy can hide our real IP address and prevent being blocked or restricted by the target website.

Install related libraries

Before we start, we need to make sure that Python 3 and related network request libraries (such as requests) and parsing libraries (such as BeautifulSoup) are installed. These libraries can be easily installed through the pip command:

pip install requests beautifulsoup4

Get proxy list from URL

First, we need a URL containing proxy information. This URL can be a website that provides free or paid proxy services. We will use the requests library to send HTTP requests and get the web page content.
 

import requests

# Suppose we have a URL containing a list of proxies
proxy_url = 'http://example.com/proxies'

# Send a GET request to obtain the web page content
response = requests.get(proxy_url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.text
else:
    print(f"Failed to retrieve page with status code: {response.status_code}")
    exit()

Parsing proxy information

Next, we need to parse the web page content to extract the proxy information. This usually involves parsing HTML, and we can use the BeautifulSoup library to accomplish this task.

from bs4 import BeautifulSoup

# Parsing web content using BeautifulSoup
soup = BeautifulSoup(page_content, 'html.parser')

# Assume that proxy information is stored in a specific HTML tag format
# For example:<tr><td>IP address</td><td>Port</td></tr>
proxies = []
for row in soup.find_all('tr'):
    columns = row.find_all('td')
    if len(columns) == 2:
        ip = columns.text.strip()
        port = columns.text.strip()
        proxies.append(f"{ip}:{port}")

# Print the parsed proxy list
print(proxies)

Verify Proxies

After getting the list of proxies, we need to verify that these proxies are working. This can be done by trying to access a test website using these proxies and checking the response.

import requests

# Define a test URL
test_url = 'http://httpbin.org/ip'

# Verify each proxy
valid_proxies = []
for proxy in proxies:
    try:
        response = requests.get(test_url, proxies={'http': proxy, 'https': proxy})
        if response.status_code == 200:
            valid_proxies.append(proxy)
            print(f"Valid proxy: {proxy}")
        else:
            print(f"Invalid proxy: {proxy} (Status code: {response.status_code})")
    except requests.exceptions.RequestException as e:
        print(f"Error testing proxy {proxy}: {e}")

Conclusion

Through the above steps, we have successfully obtained and parsed the proxy information from the URL and verified the effectiveness of these proxies. These proxies can now be used in our crawler tasks to improve the efficiency and stability of the crawler. It should be noted that we should comply with the usage policies and laws and regulations of the target website to ensure that our crawler behavior is legal and ethical.

About the author

SwiftProxy
Emily Chan
Lead Writer at Swiftproxy
Emily Chan is the lead writer at Swiftproxy, bringing over a decade of experience in technology, digital infrastructure, and strategic communications. Based in Hong Kong, she combines regional insight with a clear, practical voice to help businesses navigate the evolving world of proxy solutions and data-driven growth.
The content provided on the Swiftproxy Blog is intended solely for informational purposes and is presented without warranty of any kind. Swiftproxy does not guarantee the accuracy, completeness, or legal compliance of the information contained herein, nor does it assume any responsibility for content on thirdparty websites referenced in the blog. Prior to engaging in any web scraping or automated data collection activities, readers are strongly advised to consult with qualified legal counsel and to review the applicable terms of service of the target website. In certain cases, explicit authorization or a scraping permit may be required.
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email