Downloading Your Web Site in Static Form on Linux or Macs with WGET

Short version

On a Mac or Linux install "wget". Then run the command below from a terminal prompt replacing the URL with your site's URL.

wget --limit-rate=400k --no-clobber  && \
--convert-links --restrict-file-names=windows  && \
--random-wait -r -p -E -e && \
robots=off -U mozilla && \
http://www.example.com

Longer version including using bash script aliases etc...

Personally I use a .bash_profile alias that doesn't hit enter so I can simply type:

 getstatic http://www.example.com

That alias lives in ~/.bash_profile on linux as your .bashrc will automatically do a "try" at least on ubuntu 12.04 and 14.04 looking for that file. My alias looks very similar to this:

alias getstatic="cd /var/www/backups/ && wget --limit-rate=400k --no-clobber --convert-links --restrict-file-names=windows --random-wait -r -p -E -e robots=off -U mozilla ";

If you wanted to type it out in series because it was a one time thing it would probably look more like opening terminal and something along the lines of:

sudo mkdir -p /var/www/backups
wget --limit-rate=400k --no-clobber --convert-links --restrict-file-names=windows --random-wait -r -p -E -e robots=off -U mozilla http://www.example.com

Obviously replace the http://www.example.com with the site you are downloading. And check the protocol as it must be exact such as http:// or https:// (note the "s" on the second one.)

If you are on a Mac this can be done much easier by purchasing a program for $5 from the app store called sitesucker.

Words of Caution - Use Common Sense

ISPs are constantly under attack. Their job is to defend against those attacks and if you execute either incorrectly or for the wrong reasons then you are doing it wrong. It will get you banned and is possibly illegal.

There is a cure that helps you avoid this danger. Common sense. Wanna be really smart about it? Call or email ahead and let them know you are creating a static fail-over replica. And don't download someone else's copyrighted material. These are not Internet issues, they are legal issues so read up on the laws in your country.

Specific Suggestions on Static Backups

  • if you have a large site - tell the ISP you are going to schedule a weekly static backup. Otherwise we will block your ip sometimes temporarily and sometimes permanently using tools like fail2ban ( https://github.com/fail2ban/fail2ban ). If you get banned, honestly, you probably deserve it as the cloud is a public good. So be cool.
  • Have someone in IT run it for you. They will understand both sides and the dangers.
  • Don't be in a hurry. Throttle the download-rate and give it a longer delay. So what if it takes a few days to download your site because that means your site is also UP to other actual visitors?
  • ONLY USE IT ON YOUR OWN SITE
  • I recommend you tell your provider ahead of time regardless.
  • Another recommendation if you are tech savvy is to spin up another instance of linux to do the wget calls? Then tar the folders and download the compressed file. then kill the temporary instance you used. It's always faster to go cloud to cloud.

Other Sites - For Offline Reading - Can I Download Those?

Yes, but don't use these tools for that. There are offline reader plugins available for most browsers. Just be careful as some contain malware. I use one in Firefox for offline reading so I can study on airplanes. All of the legitimate offline reader plugins that are legit are careful not to pound the server with traffic and identify themselves in the header so the sites allow them in.

Yet again - if you are too aggressive with your settings can and will get an entire subnet, like your entire University or every user in your company blocked based on IP address reputation. Common sense. And if used illegally you may find yourself talking to your attorneys, and hey, the rest of us like that the law is catching up on these issues. There is a difference between a carefully made static back-up of content you own and ripping a competitor's site. Common Sense.

Usage of Static Sites

A typical use of a static version of your site is for fail over redundancy. Or sometimes for SEO reasons you cache a static version for faster response time. Not everything will work perfectly, but it's a way to maintain some control and your own backup. It's in your best interest when done properly.

peace smile

-

Edited 01 Feb, 2016 16:16