Skip to content

Back to Toolbox

wget Recipes

Note: wget2 is a successor to wget, currently in development.

Website mirroring / scraping

Mirror a website

wget --mirror --no-clobber --no-parent --wait=3 --execute robots=off --domains=danburzo.ro,assets.danburzo.ro --user-agent=Mozilla danburzo.ro

A quick explanation for these flags:

Download sequential URLs

wget http://example.com/records/{1..1000}

Download a list of URLs

The URLs can be specified in a separate file list-of-URLs.txt, with one URL per line.

wget --input list-of-URLs.txt

Content auditing

With the --spider option, wget crawls a website without downloading anything and produces a summary of all broken internal links:

wget --recursive --level=0 --spider danburzo.ro

This works particularly well for running against a local development server, as crawling is very fast even for a huge set of pages:

wget --recursive --level=0 --spider http://localhost:8080

Before running the command, make sure you remove the build folder (eg. rm -rf dist) to clean up any lingering old pages which can potentially cause false-negative results for broken links.

Some things to be aware of:

Find broken outbound links

Although the -H option allows wget to look beyond the domain it is supplied, checking outbound links poses multiple challenges.

Most importantly, wget doesn't have an option to configure a separate recursion level for external domains. You can't have the infinite internal recursion that lets you crawl the website starting from its homepage because that will instruct wget to crawl the entire World Wide Web once it escapes the original domain.

When you have access to a sitemap, you can use it to grab a list of all URLs from the domain with hred, as below. In the absence of sitemap.xml, you can gather all URLs from the website using wget and a grep-like tool.

wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent

This list of URLs can be then piped back to wget with the -i - option, and crawled with --level 1 recursion to check for broken outbound links:

wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent \
| wget -i - -H --spider --recursive --level 1 --tries 1 --timeout 2

The second challenge is the process can take quite a long time — 41 minutes for the quite modest website you're now reading — and the results are only shown once (and if) the process ends. To prevent any one link from stalling the process too much, options for how much to wait for a request (--timeout) and how many times to retry (--tries) are useful.

A final challenge, which is more of an nuissance, is that wget will litter the current working directory with a whole bunch of empty folders (one for each domain it finds while crawling), which then need to be manually deleted.

# Gather the list of URLs from a website

Inspired by this article by Elliot Cooper, but using ripgrep:

wget --spider -r danburzo.ro 2>&1 \
| rg '^--.+-- (.+)$' --replace '$1' \
| sort -u

Note that this lists everything, including links to images, styles, and scripts.

The 2>&1 part redirects stderr (which has the handle 2) to stdout (1). Without this we can't pipe wget's progress to a different tool, because it outputs the information on the stderr stream.

Further reading