Blog

Web Scrapping using proxy server in python

7th May 2020

Many a time we come across the situation where we have to scrap the website but due to various factors like API time restrictions we are not able to scrap the website or we cannot fully scrap the website before being detected as a bot. so to solve this issue I am going to use selenium webdriver for Scrapping the website using a rotating proxy so for this we require a bunch of proxies if we have private IPs with us we can use that else in this blog I am going to scrap https://free-proxy-list.net/ to get a bunch of proxies so overall my blog is divided into two-part first is scraping to get IPs and second is using that IPs to scarp our intended website.
let's begin with scraping https://free-proxy-list.net/
for this, we will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

code sample for scraping the https://free-proxy-list.net/

 
                     
                            from bs4 import BeautifulSoup
                            import requests
                            
                            data = requests.get("https://free-proxy-list.net/")
                            
                            parseddata = BeautifulSoup(data.text, "html.parser")
                            ips = []
                            raw_ips = parseddata.findAll('table')[0].tbody.findAll('tr')
                            for ip in raw_ips:
                                data = ip.findAll('td')
                                ips.append(data[0].contents[0] + ":" + data[1].contents[0])

This will give out the array of ip addresses in format ip:port and we will use this IPs to connect to external website kindly note Free proxies available on the internet are always abused and end up being in blacklists used by anti-scraping tools and web servers. If you are doing serious large-scale data extraction, you should pay for some good proxies. we can use any method to use this available IPs whether it be some random ip or we can use round-robin to pick up the IPs

Second part consists of scraping using this IPs
We will use any of this available IPs to set manual proxy in selenium to scrap suppose from array of IPs we selected the any IP using any method of our choice and it gave us 114.119.191.83:80 as the ip address to be used

            
                    from selenium import webdriver
                    from selenium.webdriver.chrome.options import Options
                    chrome_options = Options() 
                    chrome_options.add_argument("--headless")
                    PROXY = '114.119.191.83:80' # this is ip address selected from array of ips
                    webdriver.DesiredCapabilities.CHROME['proxy']={
                        "httpProxy":PROXY,
                        "ftpProxy":PROXY,
                        "sslProxy":PROXY,
                        
                        "proxyType":"MANUAL",
                        
                    }
                    driver = webdriver.Chrome(executable_path = "path",options=chrome_options)

using this driver we can visit any web page and scrap the content of it and before being detected as the Bot we can change our proxy using any ip address from array of IP addresses.
you must be wondering about DesiredCapabilities what exactly it is . It is just a class which is used to modify multiple properties of web driver. Desired Capabilities class provides a set of key-value pairs to change individual properties of web driver such as browser name, browser platform, etc and in each request those properties are used. you can check different capabilities option on https://github.com/SeleniumHQ/selenium/wiki/DesiredCapabilities