Find http:// and or www. and strip from domain. leaving domain.com

Asked 31/1, 2013 at 12:22 Answered 14/2, 2019 at 9:57

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.

some of the urls in my log file begin with http:// and some begin with www.Some begin with both.

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected. I need the code to be more conditional. TIA

edit... here is my full code...

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

Seymour answered 31/1, 2013 at 12:22 Comment(4)

Just a note: You do realise that www.domain.com is different from domain.com, right, and may point at wildly different IPs? – Greathouse 31/1, 2013 at 12:23

What about the domains www.www.com and www.com? – Frisket 31/1, 2013 at 12:30

Duplicate: https://mcmap.net/q/545620/-get-root-domain-of-link – Marisolmarissa 31/1, 2013 at 12:31

Duplicate: #569637 I'll delete my existing post now that I can comment :) – Xylotomy 5/3, 2013 at 14:42

You can do without regexes here.

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

foo.com
foobar.com
bar.com
foobar.com

Edit:

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

Poniard answered 31/1, 2013 at 12:25 Comment(6)

@DSM Don't worry, it isn't in use ;) – Marisolmarissa 31/1, 2013 at 12:32

Thanks, That works :) Any idea how I can remove everything after the .co.uk/.com etc? – Seymour 31/1, 2013 at 12:56

I didn't get what you mean by everything. Can you explain by an example? – Poniard 31/1, 2013 at 12:58

sure. some urls are links to pages. so in the case of foo.com/index.htm i would like to be left with just foo.com – Seymour 31/1, 2013 at 13:8

That's fantastic, works as I wanted it to. many many thanks. Sorry to be a pain, I find the docs for python difficult to understand. could you perhaps explain some of the amendments you made to your code to give me some idea of how it works? thanks again. – Seymour 31/1, 2013 at 13:51

About the edit: even better just look at the beginning of the string lines = re.sub(r'(^www.)',r'',lines) so not only works for .com. – Ofelia 1/6, 2021 at 23:48

It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]

Mideast answered 31/1, 2013 at 12:31 Comment(5)

Doesn't immediately work for URLs starting without 'http://'. urlparse.urlsplit("www.foo.com").netloc will return ''. – Poniard 31/1, 2013 at 13:26

Yes, that's because www.foo.com is not a valid URL. – Mideast 31/1, 2013 at 13:35

The problem is that some of the urls in OP's file are of this format. – Poniard 31/1, 2013 at 13:38

Trying to mutate a SplitResult.netloc this way will result in an AttributeError being raised. In order to change netloc you will need to use _replace like so replaced = parsed._replace(netloc=host[4:]) – Humidifier 4/5, 2019 at 17:27

I'm not changing netloc. Am I? – Mideast 4/5, 2019 at 18:29

You can do without regexes here.

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

foo.com
foobar.com
bar.com
foobar.com

Edit:

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

Poniard answered 31/1, 2013 at 12:25 Comment(6)

@DSM Don't worry, it isn't in use ;) – Marisolmarissa 31/1, 2013 at 12:32

Thanks, That works :) Any idea how I can remove everything after the .co.uk/.com etc? – Seymour 31/1, 2013 at 12:56

I didn't get what you mean by everything. Can you explain by an example? – Poniard 31/1, 2013 at 12:58

sure. some urls are links to pages. so in the case of foo.com/index.htm i would like to be left with just foo.com – Seymour 31/1, 2013 at 13:8

About the edit: even better just look at the beginning of the string lines = re.sub(r'(^www.)',r'',lines) so not only works for .com. – Ofelia 1/6, 2021 at 23:48

I came across the same problem. This is a solution based on regular expressions:

>>> import re
>>> rec = re.compile(r"https?://(www\.)?")

>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'https://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://www.domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

Mho answered 20/4, 2016 at 20:16 Comment(0)

Check out the urlparse library, which can do these things for you automatically.

>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')

Jeannajeanne answered 31/1, 2013 at 12:27 Comment(0)

You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:

from urlparse import urlparse

url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'

o = urlparse(url)

domain = o.hostname

temp = domain.rsplit('.')

if(len(temp) == 3):
    domain = temp[1] + '.' + temp[2]

print domain

Universally answered 3/7, 2013 at 17:54 Comment(0)

I believe @Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....

I suppose:

for i in range(1,len(temp)-1):
    domain = temp[i]+"."
domain = domain + "." + temp[-1]

Is there a nicer way to do this?

Reaction answered 14/2, 2019 at 9:57 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags