wget Recipes

Website mirroring / scraping

Mirror a website

wget --mirror --no-clobber --no-parent --wait=3 --execute robots=off --domains=danburzo.ro,assets.danburzo.ro --user-agent=Mozilla danburzo.ro

A quick explanation for these flags:

--mirror is a shorthand for -r -N -l inf --no-remove-listing, which are some switches useful for scraping
--no-clobber instructs wget not to fetch each occurrence of the same URL as a separate file
--no-parent don't go up the hierarchy
--wait=seconds is used to wait for N seconds between requests, a.k.a. being nice to the server
--execute robots=off will ignore the robots.txt file on the website and any nofollow attributes on links
--domains=domain1,domain2,... is a list of domain names to consider when crawling (you'll want to include here any domains that hold the assets for the page)
--user-agent=UserAgentString can be used in case the server blocks access based on the User Agent

Download sequential URLs

wget http://example.com/records/{1..1000}

Download a list of URLs

The URLs can be specified in a separate file list-of-URLs.txt, with one URL per line.

wget --input list-of-URLs.txt

Content auditing

Find broken internal links

With the --spider option, wget crawls a website without downloading anything and produces a summary of all broken internal links:

wget --recursive --level=0 --spider danburzo.ro

Find broken outbound links

Although the -H option allows wget to look beyond the domain it is supplied, checking outbound links poses multiple challenges.

Most importantly, wget doesn't have an option to configure a separate recursion level for external domains. You can't have the infinite internal recursion that lets you crawl the website starting from its homepage because that will instruct wget to crawl the entire World Wide Web once it escapes the original domain.

When you have access to a sitemap, you can use it to grab a list of all URLs from the domain with hred, as below. In the absence of sitemap.xml, you can gather all URLs from the website using wget and a grep-like tool.

wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent

This list of URLs can be then piped back to wget with the -i - option, and crawled with --level 1 recursion to check for broken outbound links:

wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent \
| wget -i - -H --spider --recursive --level 1 --tries 1 --timeout 2

The second challenge is the process can take quite a long time — 41 minutes for the quite modest website you're now reading — and the results are only shown once (and if) the process ends. To prevent any one link from stalling the process too much, options for how much to wait for a request (--timeout) and how many times to retry (--tries) are useful.

A final challenge, which is more of an nuissance, is that wget will litter the current working directory with a whole bunch of empty folders (one for each domain it finds while crawling), which then need to be manually deleted.

# Gather the list of URLs from a website

Inspired by this article by Elliot Cooper, but using ripgrep:

wget --spider -r danburzo.ro 2>&1 \
| rg '^--.+-- (.+)$' --replace '$1' \
| sort -u

Note that this lists everything, including links to images, styles, and scripts.

The 2>&1 part redirects stderr (which has the handle 2) to stdout (1). Without this we can't pipe wget's progress to a different tool, because it outputs the information on the stderr stream.