wget Recipes
Note:
wget2
is a successor towget
, currently in development.
Website mirroring / scraping
Mirror a website
wget --mirror --no-clobber --no-parent --wait=3 --execute robots=off --domains=danburzo.ro,assets.danburzo.ro --user-agent=Mozilla danburzo.ro
A quick explanation for these flags:
--mirror
is a shorthand for-r -N -l inf --no-remove-listing
, which are some switches useful for scraping--no-clobber
instructs wget not to fetch each occurrence of the same URL as a separate file--no-parent
don't go up the hierarchy--wait=seconds
is used to wait for N seconds between requests, a.k.a. being nice to the server--execute robots=off
will ignore therobots.txt
file on the website and anynofollow
attributes on links--domains=domain1,domain2,...
is a list of domain names to consider when crawling (you'll want to include here any domains that hold the assets for the page)--user-agent=UserAgentString
can be used in case the server blocks access based on the User Agent
Download sequential URLs
wget http://example.com/records/{1..1000}
Download a list of URLs
The URLs can be specified in a separate file list-of-URLs.txt
, with one URL per line.
wget --input list-of-URLs.txt
Content auditing
Find broken internal links
With the --spider
option, wget
crawls a website without downloading anything and produces a summary of all broken internal links:
wget --recursive --level=0 --spider danburzo.ro
This works particularly well for running against a local development server, as crawling is very fast even for a huge set of pages:
wget --recursive --level=0 --spider http://localhost:8080
Before running the command, make sure you remove the build folder (eg. rm -rf dist
) to clean up any lingering old pages which can potentially cause false-negative results for broken links.
Some things to be aware of:
-
even with
--max-redirect=0
,wget
will follow HTTP 301 redirects from links that don’t have a trailing slash to links that have one, eg. from/page
to/page/
, so you can’t use it to find these sorts of links. The issue may be fixed inwget2
, but I haven’t checked. To remind myself to add a trailing slashes to links, I use a rule in my debug stylesheet. -
running the
wget
command leaves behind an empty directory calledlocalhost:8000
, which can be safely deleted.
Find broken outbound links
Although the -H
option allows wget
to look beyond the domain it is supplied, checking outbound links poses multiple challenges.
Most importantly, wget
doesn't have an option to configure a separate recursion level for external domains. You can't have the infinite internal recursion that lets you crawl the website starting from its homepage because that will instruct wget
to crawl the entire World Wide Web once it escapes the original domain.
When you have access to a sitemap, you can use it to grab a list of all URLs from the domain with hred, as below. In the absence of sitemap.xml
, you can gather all URLs from the website using wget
and a grep-like tool.
wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent
This list of URLs can be then piped back to wget with the -i -
option, and crawled with --level 1
recursion to check for broken outbound links:
wget -O - https://danburzo.ro/sitemap.xml \
| hred -xcr loc@.textContent \
| wget -i - -H --spider --recursive --level 1 --tries 1 --timeout 2
The second challenge is the process can take quite a long time — 41 minutes for the quite modest website you're now reading — and the results are only shown once (and if) the process ends. To prevent any one link from stalling the process too much, options for how much to wait for a request (--timeout
) and how many times to retry (--tries
) are useful.
A final challenge, which is more of an nuissance, is that wget
will litter the current working directory with a whole bunch of empty folders (one for each domain it finds while crawling), which then need to be manually deleted.
# Gather the list of URLs from a website
Inspired by this article by Elliot Cooper, but using ripgrep
:
wget --spider -r danburzo.ro 2>&1 \
| rg '^--.+-- (.+)$' --replace '$1' \
| sort -u
Note that this lists everything, including links to images, styles, and scripts.
The
2>&1
part redirectsstderr
(which has the handle2
) tostdout
(1
). Without this we can't pipewget
's progress to a different tool, because it outputs the information on thestderr
stream.