hred recipes
hred extracts JSON data from HTML and XML documents using QSX, a query language inspired by CSS selectors. hred accepts its input via the standard input, so it must be fetched externally, with curl or a similar tool.
Working with XML
Atom and RSS feeds
Below are abridged versions of typical Atom and RSS feeds, with a focus on the URLs of the individual posts.
<!-- Atom -->
<feed>
<entry>
<title>My post</title>
<link href="https://example.com/my-post"/>
</entry>
</feed>
<!-- RSS -->
<rss>
<channel>
<item>
<title>My post</title>
<link>https://example.com/my-post</link>
</item>
</channel>
</rss>
To extract these URLs with hred, one per line:
# Atom feed
curl https://example.com/posts.xml | hred -xcr 'entry > link:is([rel=alternate],:not([rel]))@href';
# RSS feed
curl https://example.com/posts.xml | hred -xcr 'item > link:is([rel=alternate],:not([rel]))@.textContent';
The -xcr set of flags is short for:
--xml, which interprets the input as XML rather than the default HTML;--concat, instead of outputting one big array, puts a record on each line, a format known as concatenated JSON;--raw, since this is an array of strings, outputs raw, unquoted strings.
The approach can be adapted to extracting URLs from
sitemap.xmlfiles, or from feed subscription lists in OPML format.
See also
The JSON produced by hred can be further procesed with jq.