Website mirroring / scraping
Mirror a website
wget --mirror --no-clobber --no-parent --wait=3 --execute robots=off --domains=danburzo.ro,assets.danburzo.ro --user-agent=Mozilla danburzo.ro
A quick explanation for these flags:
--mirroris a shorthand for
-r -N -l inf --no-remove-listing, which are some switches useful for scraping
--no-clobberinstructs wget not to fetch each occurrence of the same URL as a separate file
--no-parentdon't go up the hierarchy
--wait=secondsis used to wait for N seconds between requests, a.k.a. being nice to the server
--execute robots=offwill ignore the
robots.txtfile on the website and any
nofollowattributes on links
--domains=domain1,domain2,...is a list of domain names to consider when crawling (you'll want to include here any domains that hold the assets for the page)
--user-agent=UserAgentStringcan be used in case the server blocks access based on the User Agent
Download sequential URLs
Download a list of URLs
The URLs can be specified in a separate file
list-of-URLs.txt, with one URL per line.
wget --input list-of-URLs.txt
Find broken internal links
wget crawls a website without downloading anything and produces a summary of all broken internal links:
wget --recursive --level=0 --spider danburzo.ro
Find broken outbound links
-H option allows
wget to look beyond the domain it is supplied, checking outbound links poses multiple challenges.
wget doesn't have an option to configure a separate recursion level for external domains. You can't have the infinite internal recursion that lets you crawl the website starting from its homepage because that will instruct
wget to crawl the entire World Wide Web once it escapes the original domain.
When you have access to a sitemap, you can use it to grab a list of all URLs from the domain with hred, as below. In the absence of
sitemap.xml, you can gather all URLs from the website using
wget and a grep-like tool.
wget -O - https://danburzo.ro/sitemap.xml \ | hred -xcr loc@.textContent
This list of URLs can be then piped back to wget with the
-i - option, and crawled with
--level 1 recursion to check for broken outbound links:
wget -O - https://danburzo.ro/sitemap.xml \ | hred -xcr loc@.textContent \ | wget -i - -H --spider --recursive --level 1 --tries 1 --timeout 2
The second challenge is the process can take quite a long time — 41 minutes for the quite modest website you're now reading — and the results are only shown once (and if) the process ends. To prevent any one link from stalling the process too much, options for how much to wait for a request (
--timeout) and how many times to retry (
--tries) are useful.
A final challenge, which is more of an nuissance, is that
wget will litter the current working directory with a whole bunch of empty folders (one for each domain it finds while crawling), which then need to be manually deleted.
# Gather the list of URLs from a website
Inspired by this article by Elliot Cooper, but using
wget --spider -r danburzo.ro 2>&1 \ | rg '^--.+-- (.+)$' --replace '$1' \ | sort -u
Note that this lists everything, including links to images, styles, and scripts.
stderr(which has the handle
1). Without this we can't pipe
wget's progress to a different tool, because it outputs the information on the