How to get and resolve proxy from URL in Python 3 environment

SwiftProxy
By - Emily Chan
2024-12-09 20:33:16

In the process of web crawling and data collection, proxy servers play a vital role. They can help us bypass IP restrictions, hide our true identity, and improve the efficiency of crawling. This article will detail how to obtain and parse proxy information from URLs in a Python 3 environment for use in subsequent crawling tasks.

What is a proxy?

A proxy server is an intermediary server located between a client and a server. It receives requests from clients, forwards them to the target server, and returns the server's response to the client. Using a proxy can hide our real IP address and prevent being blocked or restricted by the target website.

Install related libraries

Before we start, we need to make sure that Python 3 and related network request libraries (such as requests) and parsing libraries (such as BeautifulSoup) are installed. These libraries can be easily installed through the pip command:

pip install requests beautifulsoup4

Get proxy list from URL

First, we need a URL containing proxy information. This URL can be a website that provides free or paid proxy services. We will use the requests library to send HTTP requests and get the web page content.
 

import requests

# Suppose we have a URL containing a list of proxies
proxy_url = 'http://example.com/proxies'

# Send a GET request to obtain the web page content
response = requests.get(proxy_url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.text
else:
    print(f"Failed to retrieve page with status code: {response.status_code}")
    exit()

Parsing proxy information

Next, we need to parse the web page content to extract the proxy information. This usually involves parsing HTML, and we can use the BeautifulSoup library to accomplish this task.

from bs4 import BeautifulSoup

# Parsing web content using BeautifulSoup
soup = BeautifulSoup(page_content, 'html.parser')

# Assume that proxy information is stored in a specific HTML tag format
# For example:<tr><td>IP address</td><td>Port</td></tr>
proxies = []
for row in soup.find_all('tr'):
    columns = row.find_all('td')
    if len(columns) == 2:
        ip = columns.text.strip()
        port = columns.text.strip()
        proxies.append(f"{ip}:{port}")

# Print the parsed proxy list
print(proxies)

Verify Proxies

After getting the list of proxies, we need to verify that these proxies are working. This can be done by trying to access a test website using these proxies and checking the response.

import requests

# Define a test URL
test_url = 'http://httpbin.org/ip'

# Verify each proxy
valid_proxies = []
for proxy in proxies:
    try:
        response = requests.get(test_url, proxies={'http': proxy, 'https': proxy})
        if response.status_code == 200:
            valid_proxies.append(proxy)
            print(f"Valid proxy: {proxy}")
        else:
            print(f"Invalid proxy: {proxy} (Status code: {response.status_code})")
    except requests.exceptions.RequestException as e:
        print(f"Error testing proxy {proxy}: {e}")

Conclusion

Through the above steps, we have successfully obtained and parsed the proxy information from the URL and verified the effectiveness of these proxies. These proxies can now be used in our crawler tasks to improve the efficiency and stability of the crawler. It should be noted that we should comply with the usage policies and laws and regulations of the target website to ensure that our crawler behavior is legal and ethical.

關於作者

SwiftProxy
Emily Chan
Swiftproxy首席撰稿人
Emily Chan是Swiftproxy的首席撰稿人,擁有十多年技術、數字基礎設施和戰略傳播的經驗。她常駐香港,結合區域洞察力和清晰實用的表達,幫助企業駕馭不斷變化的代理IP解決方案和數據驅動增長。
Swiftproxy部落格提供的內容僅供參考,不提供任何形式的保證。Swiftproxy不保證所含資訊的準確性、完整性或合法合規性,也不對部落格中引用的第三方網站內容承擔任何責任。讀者在進行任何網頁抓取或自動化資料蒐集活動之前,強烈建議諮詢合格的法律顧問,並仔細閱讀目標網站的服務條款。在某些情況下,可能需要明確授權或抓取許可。
Join SwiftProxy Discord community Chat with SwiftProxy support via WhatsApp Chat with SwiftProxy support via Telegram
Chat with SwiftProxy support via Email