How to extract a filename from a URL and append a word to it?
Asked Answered
S

12

97

I have the following URL:

url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg

I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg

Once I get this file name, I'm going to save it with this name to the Desktop.

filename = **extracted file name from the url**     
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))

After this, I'm going to resize the photo, once that is done, I've going to save the resized version and append the word "_small" to the end of the filename.

downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))               
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))

From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:

09-09-201315-47-571378756077.jpg

rename to:

09-09-201315-47-571378756077_small.jpg

How can I go about doing this?

Stain answered 10/9, 2013 at 19:32 Comment(0)
B
237

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that's the case, you'll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote

url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path

url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg

new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

Bilow answered 10/9, 2013 at 19:41 Comment(8)
caution: os.path in windows might expect "\"Yulan
You don't even need urlparse. os.path.basename(url) works perfect.Ironware
@Ironware One does need urlparse. Only with using urlparse an url with query string like http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg?size=1000px will be extracted to a filename 09-09-201315-47-571378756077.jpg. If you only use os.path.basename(url) the extracted filename will include the query-string: 09-09-201315-47-571378756077.jpg?size=1000px . This is usually not the desired solution.Flasher
Because the separator on Windows is different, I have confirmed that this solution works on Windows.Barm
@Jean-Francois lets not add too much to the answer and I think you should urlparse the URL as it is before you unquote, because unquote doesn't expect a URL, it expects just /the/path/part of the url.Nigh
@BorisV Good point :) Although unquote before ` urlparse` does work, actually and the code looks slightly neater.Scurvy
@Jean-FrancoisT. it doesn't work, you just didn't think of the edge cases, like when you have a percent encoded #. Try Path(unquote(urlparse('http://example.com/my%20%23superawesome%20picture.jpg').path)).name vs Path(urlparse(unquote('http://example.com/my%20%23superawesome%20picture.jpg')).path).name. It's just never a good idea to blindly modify something you intend to parse before parsing it.Nigh
@BorisV Good point. CorrectedScurvy
P
39

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'

In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'

In [3]: os.path.basename("https://example.com/")
Out[3]: ''

In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path

In [2]: def get_filename(url):
   ...:     fragment_removed = url.split("#")[0]  # keep to left of first #
   ...:     query_string_removed = fragment_removed.split("?")[0]
   ...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
   ...:     if scheme_removed.find("/") == -1:
   ...:         return ""
   ...:     return path.basename(scheme_removed)
   ...:

In [3]: get_filename("a.com/b")
Out[3]: 'b'

In [4]: get_filename("a.com/")
Out[4]: ''

In [5]: get_filename("https://a.com/")
Out[5]: ''

In [6]: get_filename("https://a.com/b")
Out[6]: 'b'

In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'
Plasmagel answered 7/8, 2018 at 11:49 Comment(7)
@Pi "Nobody has thus far provided a complete solution" the accepted answer is a "complete solution" that throws out the # and ? parts of the URL which it does using the URL parsing built into Python (which might handle an edge case you didn't think of).Nigh
I prefer this answer to the one above that uses urllib.parse.urlparse with os.path.basename by @Boris, because this answer only imports the os package, not urllib which is mostly duplicated by Requests and superseded by urllib2. One less dependency to become obsolete and causing future code maintenance.Excruciate
@RichLysakowskiPhD there is no such thing as urllib2 on Python 3 and requests uses urllib.parse under the hood. How is implementing URL parsing yourself a smaller maintenance burden than an import?Nigh
@Boris you are right. urllib2 does not exist in Python 3, so urllib built into Python or requests is the way to go. Thank you for clarifying with a source url : github.com/psf/requests/blob/…Excruciate
I find the topmost solution more clean. I guess this is just an old post?Asthenosphere
@BorisV edge cases like: "https://toto.com/dir/my%20file%20has%20spaces.txt" which contain spaces... This would be handled by unquote in urllib.parse.Scurvy
Would not that be easier with regex instead of multiple split / find?Scurvy
P
21
filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use ".jpg" in the last case since a . can also be in the filename.

Phyto answered 10/9, 2013 at 19:39 Comment(2)
Just as a note, /path/to/image27.08.2016.jpg would become image27_small.08_small.2016_small.jpgBast
yeah its not working for all, so it should't be considered as the correct answerSluggish
C
19

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1] 
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg
Confine answered 10/9, 2013 at 19:52 Comment(2)
Easy to read and does not use any external package, best answer.Pernickety
For websites like github that add args to the url like '?raw=true', this will not work.Martingale
I
11

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path

p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'

print(p.stem)
# >>> 'somefile'

print(p.suffix)
# >>> '.html'

print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'

❗️ WARNING

The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths only. Don't use it in production code! It's a dirty quick hack for non-critical code. The fact that pathlib also works with URLs can be considered an accident that might be fixed in future releases. The code is only provided as an example of what you can but probably should not do. If you need to parse URLs in a canonic way then prefer using urllib.parse or alternatives. Or, if you make an assumption that the portion after the domain and before the parameters+queries+hash is supposedly a POSIX path then you can extract just the path fragment using urllib.parse.urlparse and then use pathlib.Path to manipulate it.

Ironhanded answered 3/1, 2021 at 18:58 Comment(1)
This breaks with URLs with stuff after the path. Path('http://example.com/somefile.html?some-querystring#some-id').name will return 'somefile.html?some-querystring#some-id'Nigh
N
10

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path


url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'
Nigh answered 10/3, 2020 at 19:44 Comment(2)
Seems like this might not work if you were running on Windows, right?Frizzell
@Frizzell it will work because pathlib uses forward slashes when defining paths, even on Windows. However note that pathlib converts "/" to "\" on Windows when you convert Path objects to str or bytes, so if you're modifying the above code to do something different, like getting the filename and the part before it (as in path/a_filename.jpg) but you want to keep forward slashes as forward slashes, you can do str(PurePosixPath(urlparse(url).path)) instead of str(Path(urlparse(url).path)).Nigh
E
1

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0] 
new_filename = filename.replace(".jpg", "_small.jpg")
Elnora answered 10/6, 2019 at 3:38 Comment(1)
sometimes there's a #fragment like this: tools.ietf.org/html/rfc3986#section-3.5Nigh
I
1

A simple version using the os package:

import os

def get_url_file_name(url):
    url = url.split("#")[0]
    url = url.split("?")[0]
    return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'
Interradial answered 17/2, 2021 at 18:37 Comment(0)
A
1

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir's)

import os
from urllib.parse import urlparse


a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn't work with this page however, as the page isn't available anymore

Asthenosphere answered 6/10, 2021 at 13:8 Comment(0)
D
0

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult

def update_filename(url):
    parsed_url = urlparse(url)
    path = parsed_url.path

    filename = path[path.rfind('/') + 1:]

    if not filename:
        return

    file, extension = filename.rsplit('.', 1)

    new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
    parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})

    return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'
Disario answered 21/8, 2022 at 13:10 Comment(0)
C
-1

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'

new_name = '{0}_small.jpg'.format(imageName) 
Castigate answered 10/9, 2013 at 19:41 Comment(0)
P
-2

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg


print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg
Premier answered 11/7, 2020 at 4:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.