Web Scrapping using proxy server in python
7th May 2020
Many a time we come across the situation where we have to scrap the website but due to various factors like
API time restrictions
we are not able to scrap the website or we cannot fully scrap the website before being detected as a bot. so to
solve this issue I am going to use
selenium webdriver for Scrapping the website using a rotating proxy so for this we require a bunch of proxies if we
have private IPs with us we can use that else
in this blog I am going to scrap
https://free-proxy-list.net/ to get a bunch of proxies
so overall my blog is divided into two-part first is scraping to get IPs and second is using that IPs to scarp
our intended website.
let's begin with scraping https://free-proxy-list.net/
for this, we will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
you must be wondering about DesiredCapabilities what exactly it is . It is just a class which is used to modify multiple properties of web driver. Desired Capabilities class provides a set of key-value pairs to change individual properties of web driver such as browser name, browser platform, etc and in each request those properties are used. you can check different capabilities option on https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities
let's begin with scraping https://free-proxy-list.net/
for this, we will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
code sample for scraping the https://free-proxy-list.net/
from bs4 import BeautifulSoup
import requests
data = requests.get("https://free-proxy-list.net/")
parseddata = BeautifulSoup(data.text, "html.parser")
ips = []
raw_ips = parseddata.findAll('table')[0].tbody.findAll('tr')
for ip in raw_ips:
data = ip.findAll('td')
ips.append(data[0].contents[0] + ":" + data[1].contents[0])
This will give out the array of ip addresses in format ip:port and we will use this IPs to connect to external website
kindly note Free proxies available on the internet are always abused and end up being in blacklists used by
anti-scraping tools and web servers.
If you are doing serious large-scale data extraction, you should pay for some good proxies.
we can use any method to use this available IPs whether it be some random ip or we can use round-robin to pick up the
IPs
Second part consists of scraping using this IPs
We will use any of this available IPs to set manual proxy in selenium to scrap
suppose from array of IPs we selected the any IP using any method of our choice and it gave us
114.119.191.83:80 as the ip address to be used
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
PROXY = '114.119.191.83:80' # this is ip address selected from array of ips
webdriver.DesiredCapabilities.CHROME['proxy']={
"httpProxy":PROXY,
"ftpProxy":PROXY,
"sslProxy":PROXY,
"proxyType":"MANUAL",
}
driver = webdriver.Chrome(executable_path = "path",options=chrome_options)
using this driver we can visit any web page and scrap the content of it and before being detected as the Bot we
can change our proxy using any ip address from array of IP addresses. you must be wondering about DesiredCapabilities what exactly it is . It is just a class which is used to modify multiple properties of web driver. Desired Capabilities class provides a set of key-value pairs to change individual properties of web driver such as browser name, browser platform, etc and in each request those properties are used. you can check different capabilities option on https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities