I have a df that has thousands of links like the ones below, for different users, in a column labeled url:
https://www.google.com/something
https://mail.google.com/anohtersomething
https://calendar.google.com/somethingelse
https://www.amazon.com/yetanotherthing
I have the following code:
import urlparse
df['domain'] = ''
df['protocol'] = ''
df['domain'] = ''
df['path'] = ''
df['query'] = ''
df['fragment'] = ''
unique_urls = df.url.unique()
l = len(unique_urls)
i=0
for url in unique_urls:
i+=1
print "\r%d / %d" %(i, l),
split = urlparse.urlsplit(url)
row_index = df.url == url
df.loc[row_index, 'protocol'] = split.scheme
df.loc[row_index, 'domain'] = split.netloc
df.loc[row_index, 'path'] = split.path
df.loc[row_index, 'query'] = split.query
df.loc[row_index, 'fragment'] = split.fragment
The code is able to parse and split the urls correctly, but it is slow since I am iterating over each row of the df. Is there a more efficient way to parse the URLs?