Ryan Rampersad
Thoughts, opinions, ideas and now links
  • About
  • Podcast
  • Links

Archives

wget caching

On a recent podcast, the studio experienced some major Internet connectivity issues. We couldn’t use any direct quotes or reference any of material so we were kind of stuck. I thought I could rig my server to prefetch all of our show note pages prior to the show in the future. I only needed single pages with mostly complete assets. I found a great Superuser thread on this topic via user35651.

from the wget manual (1.12):

“Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ā€˜-p’: ”

wget -E -H -k -K -p url

This solution works great, but I also add in -nd -w 3. That will add in a no directory switch – essentially preventing the creation of thousands of folders, and a small delay between each request so I don’t destroy the local internet in the house, nor the remote servers.

wget: background download

Here I am at the University of Minnesota and I find out that I need to download this huge 3.4 gigabyte file. I don’t need it now, but I know I’ll need it eventually. What do I do? I ssh to my server at home and start up wget. But in the past, I realized, wget will fail to function properly if I close the terminal, and since I have class, that would happen quite soon. And this download will take hours.

wget is smart enough though to offer a background option that will allow it to be decoupled from the terminal process that started it.

Startup:
  -V,  --version           display the version of Wget and exit.
  -h,  --help              print this help.
  -b,  --background        go to background after startup.
  -e,  --execute=COMMAND   execute a `.wgetrc'-style command.

This allows me to freely leave the connection hanging, and it’ll still continue at home without me. But what about progress now? Ever heard of tail?

Usage: tail [OPTION]... [FILE]...
Print the last 10 lines of each FILE to standard output.

There’s an another option that will allow me to follow the increasingly added data to the file too, essentially appending it to the original 10 lines. In short, it will add 10 lines to the screen as soon as they are ready. When you started the background wget you also were told about the wget_log file: “wget-log”.

So just run a tail -f wget-log and you’ll see the output of the progress of your super massive download that is decoupled from your terminal session! It’s fantastic.

© 2013 Ryan Rampersad.