How to (re)download file with wget only when the file is newer or the size changed?
Asked Answered
G

1

0

I am downloading an archive with wget, how can I use wget to only redownload that file when the file is newer on the server or the size has changed?

I'm aware of the -N flag but it doesn't work.

Gratian answered 6/9, 2023 at 9:29 Comment(0)
G
0

TL;DR There is a critical bug introduced in or around wget 1.17 that broke this feature.

  • In older wget, you need to do wget -N https://example.com/file.zip

  • In newer wget, you need to do wget -N --no-if-modified-since https://example.com/file.zip

The server must support HEAD request and provide both timestamp (Last-Modified) and size (Content-Length).

Debugging

Use the -d flag to display request headers response headers for debugging.

wget --version
wget -N -d https://example.com/file.zip
truncate --size 1 file.zip
wget -N -d https://example.com/file.zip

In older versions where it used to work, wget sends a HEAD request to obtain the last modified time and the file size, then if either changed, wget sends a GET request (without Last-Modified-Since) to download the file.

In newer versions where it's broken, wget sends a single GET request (with Last-Modified-Since), to only download the file is date has changed. Unfortunately that doesn't work.

The change in behavior is broken by design, it simply cannot detect changes in file size, and as a side effect wget will never recover from a partial interrupted download.

When sending a HTTP GET request with a timestamp, the server can respond 304 Not Modified code with no content and no file size. The 304 code is only based on the last modification time provided by the client. Unfortunately this leaves no chance to wget to ever know about the file size or to redownload the file.

# wget 1.21 in ubuntu 22, broken
wget -N https://example.com/file.zip -d
truncate --size 1 file.zip
wget -N https://example.com/file.zip -d

---request begin---
GET /file.zip HTTP/1.1
Host: examplpe.com
If-Modified-Since: Thu, 31 Aug 2023 18:22:20 GMT
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 304 Not Modified
Date: Wed, 06 Sep 2023 09:10:16 GMT
Connection: keep-alive
Last-Modified: Thu, 31 Aug 2023 18:22:20 GMT
ETag: f37ffefc58f99f0b996a38154d87820344d86d41
Accept-Ranges: bytes
Content-Disposition: attachment; filename="file.zip"; filename*=UTF-8''file.zip

---response end---
304 Not Modified
Registered socket 3 for persistent reuse.
File ‘file.zip’ not modified on server. Omitting download.

web browsers do not suffer from this caching issue because they store the ETag header from the initial response, a unique id representing a unique version of the file. Apache and nginx generate the ETag automatically when serving static files based on last modification time and file size.

Gratian answered 6/9, 2023 at 9:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.