Finding "all" domains of a country
Asked Answered
O

2

6

I'm searching a way to find "all" the sites ending with an given tld. I had several ideas on how to realize that, but I'm not sure what is the best/most effectiv way to realize this. I'm aware that pages that are linked nowhere aren't findable by spiders etc, so fir this example I'll not care about isolated pages. What I want to do, I want to have an TLD as input for my programm, and I which to have a list of sites as output. For example:

# <program> .de
- spiegel.de
- deutsche-bank.de
...
- bild.de

So what is the best way to reach this? Are there tools available to help me, or how would you program this?

Osrick answered 23/8, 2012 at 18:24 Comment(3)
Sure? DNS Zone transfer could give you the list if and only if you are authorized to do a AXFR en.wikipedia.org/wiki/DNS_zone_transferMoultrie
Hello Rene, thx for your answer. I did some research on your post and I'm able to perform such AXFR queries for one domain, now I'm unsure how I would do it for an entire TLD, I used dig for my tests. Are there better tools?Osrick
AFAIK the DNS servers in the wild don't allow AXFR commands for non-authorative servers, which you and I probably have. If such a tool exist dig should be up to the task.Moultrie
T
8

This answer might be a bit late but I've just found this.

You could try using Common Crawler awesome data.

So, what is Common Crawler?

Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

Using their url search tool query for .de then download the result as a json file.

You will get a nice file of results then you will need to do some work on it since it includes all the site map of a domain (hence crawling).

Another drawback that some sites use unwelcoming robot.txt file so crawlers won't be included them still it's the best result i could find so far.

Timothee answered 18/10, 2015 at 12:15 Comment(1)
It seems like their data is not updated often. For .by domain the results contain only pages prior to Mar 2012.Hughes
T
1

The code bellow is a multithreaded domain checker script in python3 that uses something like a brute-force string generator that is appended to a list and that list has all the possible combinations (depending on the length that is specified)of chars maybe you need to add some characters to it. I successfully used it for Chinese, Russian, Dutch sites.

from multiprocessing.pool import ThreadPool
from urllib.request import urlopen
import pandas as pd

from itertools import product

chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890' # add all chars in your language
urls = []# list
for length in range(0, 9999): # Change this length 
    to_attempt = product(chars, repeat=length)
    for attempt in to_attempt:
        a=("https://"+''.join(attempt)+".de")
        urls.append(a)


import sys
sys.stdout = open('de.csv','wt')
def fetch_url(url):
    try:
        response = urlopen(url)
        return url, response.read(), None
    except Exception as e:
        return url, None, e

start = timer()
results = ThreadPool(4000).imap_unordered(fetch_url, urls)
for url, html, error in results:
    if error is None:
        print(url)
Truditrudie answered 9/8, 2018 at 11:0 Comment(2)
Good idea. One can try also a dictionary, used for passwords crackers.Solubilize
This code fills up memory while generating the links. I was not able to use it.Fascicule

© 2022 - 2024 — McMap. All rights reserved.