Scraping free proxy listing website
Asked Answered
B

7

9

I am trying to scrape one of the free proxy listings website but, I just couldn't be able to scrape the proxies.

Below is my code:

import requests
import re

url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}

source = requests.get(url, headers=headers, timeout=10).text

proxies = re.findall(r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?', source)

print(proxies)

I would highly appreciate if someone could help me without the use of additional libraries/modules like BeautifulSoup.

Birdhouse answered 24/1, 2018 at 15:58 Comment(0)
S
15

It is generally best to use a parser such as BeautifulSoup to extract data from html rather than regular expressions because it is very difficult to reproduce BeautifulSoup's accuracy; however, you can try this with pure regex:

import re
url = 'https://free-proxy-list.net/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text)
data = [list(filter(None, i))[0] for i in re.findall('<td class="hm">(.*?)</td>|<td>(.*?)</td>', source)]
groupings = [dict(zip(['ip', 'port', 'code', 'using_anonymous'], data[i:i+4])) for i in range(0, len(data), 4)]

Sample output (actual length is 300):

[{'ip': '47.88.242.10', 'port': '80', 'code': 'SG', 'using_anonymous': 'anonymous'}, {'ip': '118.189.172.136', 'port': '80', 'code': 'SG', 'using_anonymous': 'elite proxy'}, {'ip': '147.135.210.114', 'port': '54566', 'code': 'PL', 'using_anonymous': 'anonymous'}, {'ip': '5.148.150.155', 'port': '8080', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '186.227.8.21', 'port': '3128', 'code': 'BR', 'using_anonymous': 'anonymous'}, {'ip': '49.151.155.60', 'port': '8080', 'code': 'PH', 'using_anonymous': 'anonymous'}, {'ip': '52.170.255.17', 'port': '80', 'code': 'US', 'using_anonymous': 'anonymous'}, {'ip': '51.15.35.239', 'port': '3128', 'code': 'NL', 'using_anonymous': 'elite proxy'}, {'ip': '163.172.27.213', 'port': '3128', 'code': 'GB', 'using_anonymous': 'elite proxy'}, {'ip': '94.137.31.214', 'port': '8080', 'code': 'RU', 'using_anonymous': 'anonymous'}]

Edit:

To concatenate the ip and the port, iterate over each grouping and use string formatting:

final_groupings = [{'full_ip':"{ip}:{port}".format(**i)} for i in groupings]

Output:

[{'full_ip': '47.88.242.10:80'}, {'full_ip': '118.189.172.136:80'}, {'full_ip': '147.135.210.114:54566'}, {'full_ip': '5.148.150.155:8080'}, {'full_ip': '186.227.8.21:3128'}, {'full_ip': '49.151.155.60:8080'}, {'full_ip': '52.170.255.17:80'}, {'full_ip': '51.15.35.239:3128'}, {'full_ip': '163.172.27.213:3128'}, {'full_ip': '94.137.31.214:8080'}]
Siphonostele answered 24/1, 2018 at 16:16 Comment(0)
T
11

You can do something like below as well, if you try using BeautifulSoup instead of regex:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://free-proxy-list.net/', headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#proxylisttable tbody tr"):
    proxy_list = ':'.join([item.text for item in items.select("td")[:2]])
    print(proxy_list)

Partial output:

122.183.139.109:8080
154.66.122.130:53281
110.77.183.158:42619
159.192.226.247:54214
47.89.41.164:80
Theory answered 24/1, 2018 at 20:47 Comment(1)
thanks! this is a much cleaner solution than using regEx aboveDeterge
I
4

An alternative to BeautifulSoup that you could use is pandas. I have had success scraping free-proxy-list.net using the pandas.read_html function

import requests
import pandas as pd 

resp = requests.get('https://free-proxy-list.net/') 
df = pd.read_html(resp.text)[0]

Resulting DataFrame stored in df:

         IP Address     Port Code               Country    Anonymity Google Https    Last Checked
0      2.50.154.155  53281.0   AE  United Arab Emirates  elite proxy     no   yes   6 seconds ago
1    134.249.165.49  53281.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
2    158.58.133.106  41258.0   RU    Russian Federation  elite proxy     no   yes   6 seconds ago
3     92.52.186.123  32329.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
4     178.213.0.207  35140.0   UA               Ukraine  elite proxy     no   yes   6 seconds ago
..              ...      ...  ...                   ...          ...    ...   ...             ...
296    93.185.96.60  41003.0   CZ        Czech Republic  elite proxy     no   yes  22 minutes ago
297    1.20.103.248  52574.0   TH              Thailand  elite proxy     no   yes  22 minutes ago
298    190.210.8.92   8080.0   AR             Argentina  elite proxy     no   yes  22 minutes ago
299  166.150.32.182  56074.0   US         United States  elite proxy     no   yes  22 minutes ago
300             NaN      NaN  NaN                   NaN          NaN    NaN   NaN             NaN

[301 rows x 8 columns]

This DataFrame can be manipulated any which way now. For example, say I only wanted elite proxy's that are also listed in the United States, I could so something like df[(df['Anonymity'] == 'elite proxy') & (df['Country'] == 'United States')] which would return

         IP Address     Port Code        Country    Anonymity Google Https    Last Checked
32    138.68.53.220   5836.0   US  United States  elite proxy     no   yes   6 seconds ago
76   173.217.255.36  33351.0   US  United States  elite proxy     no    no  10 seconds ago
86    24.172.34.114  40675.0   US  United States  elite proxy     no    no  10 seconds ago
111   209.190.32.28   3128.0   US  United States  elite proxy     no   yes  10 seconds ago
150  104.148.76.176   3128.0   US  United States  elite proxy     no    no  11 minutes ago
151  104.148.76.185   3128.0   US  United States  elite proxy     no    no  11 minutes ago
168  104.148.76.136   3128.0   US  United States  elite proxy     no    no  11 minutes ago
169  104.148.76.182   3128.0   US  United States  elite proxy     no    no  11 minutes ago
182  104.148.76.183   3128.0   US  United States  elite proxy     no   yes  11 minutes ago
184      3.95.11.66   3128.0   US  United States  elite proxy     no   yes  12 minutes ago
190    63.249.67.70  53281.0   US  United States  elite proxy     no    no  12 minutes ago
288  205.201.49.141  53281.0   US  United States  elite proxy     no   yes  22 minutes ago
299  166.150.32.182  56074.0   US  United States  elite proxy     no   yes  22 minutes ago

From here, it's as easy as df['IP Address'] and df['Port'] to get the IP address' and associated ports

Irmgardirmina answered 16/5, 2020 at 3:7 Comment(1)
This works for me, and is a much simpler and faster solution. +1!Detonate
P
4

If you just need proxylist you can use the following library.

https://pypi.org/project/free-proxy/

It is scraping proxies from the https://www.sslproxies.org/. I have tested it for few proxies both sites have the same data.

Peristyle answered 6/10, 2021 at 17:7 Comment(1)
I was able to get a residential proxy gateway, and this solution working, many thanks, this will save me much.Cyma
S
0

You can use Agenty chrome extension to write/test CSS selectors easily and then use that configuration to run it with BeautifulSoup. Here is an example - https://forum.agenty.com/t/how-to-scrape-free-proxy-list-from-internet/19

enter image description here

Full Disclosure - I am developer of this product.

Sphenoid answered 23/7, 2019 at 10:55 Comment(0)
A
0

Building forth on @Abdul Majeed his answer, free-proxy package fetches the entire proxy list, but will then only forward a single one to the user. In order to get the entire list, their code can be amended as such:

import lxml.html as lh

class FreeProxyException(Exception):
    '''Exception class with message as a required parameter'''
    def __init__(self, message) -> None:
        self.message = message
        super().__init__(self.message)

try:
    page = requests.get('https://www.sslproxies.org')
    doc = lh.fromstring(page.content)
except requests.exceptions.RequestException as e:
    raise FreeProxyException('Request to www.sslproxies.org failed') from e
try:
    tr_elements = doc.xpath('//*[@id="list"]//tr')
    proxy_list= [f'{tr_elements[i][0].text_content()}:{tr_elements[i][1].text_content()}' for i in range(1, len(tr_elements))]
except Exception as e:
    raise FreeProxyException('Failed to get list of proxies') from e

Returns proxy list:

['85.195.104.71:80',
 '177.12.238.100:3128',
 '151.181.91.10:80',
 '149.129.131.46:8080',
 '85.214.124.194:5001',
 '54.194.252.228:3128',
 '103.169.20.46:8080',
 '177.12.238.1:3128',
 '204.185.204.64:8080',
 '45.169.162.1:3128',
 '170.39.194.156:3128',
 '198.59.191.234:8080',
 '200.105.215.18:33630',
 '212.71.255.43:38613',
 '115.75.70.79:4100',
 '193.242.138.1:3128',
 '192.53.163.144:3128',
 '193.122.71.184:3128',
 '8.209.249.96:8080',
 '144.217.131.61:3148',
 '45.56.75.90:5344',
 '149.129.239.170:8080',
 '143.198.40.24:8888',
 '66.175.223.147:4153',
 '194.195.213.197:1080',
...
 '185.204.170.116:80',
 '111.225.152.74:8089',
 '111.225.153.204:8089',
 '151.234.44.60:8080',
 '91.106.212.14:3128']
Augustina answered 20/10, 2022 at 14:19 Comment(0)
I
0
import requests
import re
url = "https://free-proxy-list.net/" 
#url = "https://www.sslproxies.org/" 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
source = str(requests.get(url, headers=headers, timeout=10).text)
data_list=re.findall(r"""<td class="hm">(.*?)</td>|<td>(.*?)</td>|<td class="hx">(.*?)</td>|<td class='hm'>(.*?)</td>|<td class='hx'>(.*?)</td>""", source)
data_list=data_list[0:250*8] #ограничитель записей так как можно залесть в другую таблицу
data = [list(filter(None, i))[0] if len(list(filter(None, i)))>0 else '' for i in data_list]
groupings = [dict(zip(['ip', 'port', 'code', 'country', 'аnonymity', 'google',  'https',    'last_checked'], data[i:i+8])) for i in range(0, len(data), 8)]
print(groupings)
Itacolumite answered 26/2 at 7:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.