wget Recipes

Website mirroring / scraping

Mirror a website

wget --mirror --no-clobber --no-parent --wait=3 --execute robots=off, --user-agent=Mozilla

A quick explanation for these flags:

Download sequential URLs


Download a list of URLs

The URLs can be specified in a separate file list-of-URLs.txt, with one URL per line.

wget --input list-of-URLs.txt

Content auditing

Find broken internal links

With the --spider option, wget crawls a website without downloading anything and produces a summary of all broken internal links:

wget --recursive --level=0 --spider

Find broken outbound links

Although the -H option allows wget to look beyond the domain it is supplied, checking outbound links poses multiple challenges.

Most importantly, wget doesn't have an option to configure a separate recursion level for external domains. You can't have the infinite internal recursion that lets you crawl the website starting from its homepage because that will instruct wget to crawl the entire World Wide Web once it escapes the original domain.

When you have access to a sitemap, you can use it to grab a list of all URLs from the domain with hred, as below. In the absence of sitemap.xml, you can gather all URLs from the website using wget and a grep-like tool.

wget -O - \
| hred -xcr loc@.textContent

This list of URLs can be then piped back to wget with the -i - option, and crawled with --level 1 recursion to check for broken outbound links:

wget -O - \
| hred -xcr loc@.textContent \
| wget -i - -H --spider --recursive --level 1 --tries 1 --timeout 2

The second challenge is the process can take quite a long time — 41 minutes for the quite modest website you're now reading — and the results are only shown once (and if) the process ends. To prevent any one link from stalling the process too much, options for how much to wait for a request (--timeout) and how many times to retry (--tries) are useful.

A final challenge, which is more of an nuissance, is that wget will litter the current working directory with a whole bunch of empty folders (one for each domain it finds while crawling), which then need to be manually deleted.

# Gather the list of URLs from a website

Inspired by this article by Elliot Cooper, but using ripgrep:

wget --spider -r 2>&1 \
| rg '^--.+-- (.+)$' --replace '$1' \
| sort -u

Note that this lists everything, including links to images, styles, and scripts.

The 2>&1 part redirects stderr (which has the handle 2) to stdout (1). Without this we can't pipe wget's progress to a different tool, because it outputs the information on the stderr stream.

Further reading