How should I download specific file type from folder (and ONLY it's subfolders) using wget or httrack?

Asked 23/5, 2016 at 7:12 Answered 29/12, 2021 at 3:47

I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels)

How would be a good way to do this?

Deeply answered 23/5, 2016 at 7:12 Comment(0)

The previous proposed answer is ludicrous considering the "spider" option has ALWAYS specifically NOT DOWNLOADED, but instead followed.

Better late than never, but here is the command you seek to both mirror the desired file extension files locally, but then as a bonus pull down the target html and auto-adjust it so that if you open it locally and click the links, they will have been altered and adjusted accordingly to now point to the local drive.

wget -e robots=off -r -k -A docx,doc "https://<url>"

If this works for you, I would appreciate the answer points!

Castellan answered 29/7, 2017 at 17:43 Comment(0)

You can use --spider with -r (recursive option ) and have --accept to filter the files of your intrest

wget --spider -r --accept "*.docx"  <url>

Jerryjerrybuild answered 25/5, 2016 at 7:58 Comment(2)

Please add some explanation to the answer. – Skiffle 25/5, 2016 at 9:8

this comes out with the folder structure but none of the files. I checked the extension. is .xls and made the appropriate change in the command. – Deeply 16/6, 2016 at 9:33

Usage

wget -r -np -A pdf,doc https://web.cs.ucla.edu/~harryxu/

Result

tree

└── web.cs.ucla.edu
    ├── ~harryxu
    │   ├── papers
    │   │   ├── chianina-pldi21.pdf
    │   │   ├── dorylus-osdi21.pdf
    │   │   ├── genc-pldi20.pdf
    │   │   ├── jaaru-asplos21.pdf
    │   │   ├── jportal-pldi21.pdf
    │   │   ├── li-sigcomm20.pdf
    │   │   ├── trimananda-fse20.pdf
    │   │   ├── vigilia-sec18.pdf
    │   │   ├── vora-asplos17.pdf
    │   │   ├── wang-asplos17.pdf
    │   │   ├── wang-osdi18.pdf
    │   │   ├── wang-osdi20.pdf
    │   │   ├── wang-pldi19.pdf
    │   │   └── zuo-eurosys19.pdf

Blasien answered 29/12, 2021 at 3:47 Comment(0)

Recommended topics

Hot tags