I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels)
How would be a good way to do this?
I'm trying to use HTTrack or Wget do download some .docx files from a website. I want to do this only for a folder and it's subfolders. Ex: www.examplewebsite.com/doc (this goes down 5 more levels)
How would be a good way to do this?
The previous proposed answer is ludicrous considering the "spider" option has ALWAYS specifically NOT DOWNLOADED, but instead followed.
Better late than never, but here is the command you seek to both mirror the desired file extension files locally, but then as a bonus pull down the target html and auto-adjust it so that if you open it locally and click the links, they will have been altered and adjusted accordingly to now point to the local drive.
wget -e robots=off -r -k -A docx,doc "https://<url>"
If this works for you, I would appreciate the answer points!
You can use --spider with -r (recursive option ) and have --accept to filter the files of your intrest
wget --spider -r --accept "*.docx" <url>
Usage
wget -r -np -A pdf,doc https://web.cs.ucla.edu/~harryxu/
Result
tree
└── web.cs.ucla.edu
├── ~harryxu
│ ├── papers
│ │ ├── chianina-pldi21.pdf
│ │ ├── dorylus-osdi21.pdf
│ │ ├── genc-pldi20.pdf
│ │ ├── jaaru-asplos21.pdf
│ │ ├── jportal-pldi21.pdf
│ │ ├── li-sigcomm20.pdf
│ │ ├── trimananda-fse20.pdf
│ │ ├── vigilia-sec18.pdf
│ │ ├── vora-asplos17.pdf
│ │ ├── wang-asplos17.pdf
│ │ ├── wang-osdi18.pdf
│ │ ├── wang-osdi20.pdf
│ │ ├── wang-pldi19.pdf
│ │ └── zuo-eurosys19.pdf
© 2022 - 2024 — McMap. All rights reserved.