Recover old website off waybackmachine [closed]
Asked Answered
A

1

26

Is there a way to recover an entire website from the waybackmachine?

I have an old site that is archived but no longer have the website files to revive it again. Is there a way to recover the old data so I can get my long lost files back?

Audiovisual answered 16/3, 2012 at 1:1 Comment(3)
What do you mean by 'website files' - just the html? If yes, then surely you could just go to that webpage and download the source from there through your browser.Cystectomy
Yes, html, css, images, & possibly php files. This has multiple pages with images and custom css.Audiovisual
I've came accross the same issue and I've ended up coding a gem. To install: gem install wayback_machine_downloader then run it with the base url of the website you want to retrieve as a parameter: wayback_machine_downloader http://example.com More information: github.com/hartator/wayback_machine_downloaderThick
G
47

wget is a great tool to mirror an entire site and if you are on windows, you can use Cygwin to install it. The following command will mirror a site: wget -m domain.name

Update from comments:

The example wget command that the wont ascend to the parent dir (-np), ignores robot.txt (-e robots=off), uses the cdn domain (--domains=domain.name), and mirrors a url (the url to mirror, http://an.example.com ). All together you get:

 wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org http://web.archive.org/web/19970708161549/http://www.google.com/

If you are dealing with https and a self signed cert, u can use --no-check-certificate to disable the certificate check. The wget help is the best place to see possible options.

Grappling answered 16/3, 2012 at 1:8 Comment(9)
Thank you for the resource, much appreciated. I have a mac and app called site sucker which seems to do the same thing. The problem is downloading through a full archive.org url.Audiovisual
+ 1 for help with blocked recursive crawling! This should be approved answer.Vena
-np helps to don't get off from the specified date path.Viniferous
Great, thanks. And for a great guide to install wget on Mac OSX without homebrew or similar, checkout coolestguidesontheplanet.com/install-and-configure-wget-on-os-xRemediosremedy
When using https add --no-check-certificateGroats
Good stuff, I will update the example.Grappling
@Grappling But is there any way to download the css and photos with that command?Hawkinson
@Hawkinson you'll need to remove -np, and then it's a good idea to limit recursion, for example -l 3Slavey
Replying to @Hawkinson — no, you need a few more options, e.g.: wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains domain.tld my.domain.tld/, take a look at linuxjournal.com/content/downloading-entire-web-site-wget (note: this will work for web.archive.org as well, just add the extra options)Legendary

© 2022 - 2024 — McMap. All rights reserved.