Toolbox

Command-Line Tools

Alternatives to, and complements for, standard CLI tools

find

grep

ack

curl

Data formats

Benchmarking

General purpose

puppeteer

Run a headless version of Chrome from Node.js

Working with specific formats

See also: structured text tools

jq

For processing JSON files. Reshaping JSON with jq.

See also: fx, gron

pup

For processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.

Should work great with wget for web page data extraction.

fonttools

Does TTF/OTF conversion to and from XML. This allows you to edit fonts (e.g. metadata) in plain-text and then rebuild them.

osmosis

Filter & merge OpenStreetMap data files (XML, PBF).

electron-pdf

Generate a PDF from an URL, HTML or Markdown file.

textkit

For manipulating and analyzing text.

monolith

For saving complete web pages as a single HTML file.

csvkit

A suite of utilities for converting to and working with CSV, the king of tabular file formats.

Utilities

For de-warping scans

unproject_text: perspective recovery of text using transformed ellipses. Write-up.

page_dewarp: page dewarping and thresholding using a "cubic sheet" model. Write-up.

For upscaling images

RAISR: Google Rapid and Accurate Image Super Resolution is a technique to use Machine Learning to upscale images. There are a few implementations of the algorithm on GitHub: movehand/raisr, MKFMIKU/RAISR