Data sets
A collection of data sets with permissive licenses.
Linguistic
- atebit/Words, a huge dataset of words in four languages (English, German, Spanish and French) used in Atebits' game Letterpress.
- dariusk/corpora, a collection of small corpuses of interesting data for the creation of bots and similar stuff. See also danburzo/corpora.
Geographical
- Natural Earth Vector,
a global, public domain map dataset available at three scales and featuring tightly integrated vector and raster data
. - mledoze/countries,
world countries in JSON, CSV, XML and YAML
, as defined by ISO 3166-1. - GeoNames
contains over 27 million geographical names and consists of over 12 million unique features whereof 4.8 million populated places and 15 million alternate names
. - Geofabrik OpenStreetMap Data Extracts at the continent/country level.
- Interline OSM Extracts, city-sized portions of OpenStreetMap, produced daily.
- Who's On First,
a gazetteer or big list of places, each with a stable identifier and some number of descriptive properties about that location.