Brainstorm: word lists for multiple languages
I haven't looked very extensively for a Spanish list of words specific for crosswords, but dictionaries are easy to get for many languages in the myspell-xx
packages. E.g.
$ wc -l /usr/share/myspell/es*.dic
70159 /usr/share/myspell/es.dic
57086 /usr/share/myspell/es_ES.dic
57997 /usr/share/myspell/es_MX.dic
It is somewhat likely that people will already have a spell checking dictionary for their language. What if we built a word list at runtime from those? They are hunspell files.
Note to self: an introduction to the hunspell affix syntax.
Hunspell dictionaries have an affix system. From en_US.dic
:
astonish/DSLG
astonishing/Y
astonishment/M
astound/GDS
astounding/Y
astraddle
astrakhan/M
astral
astray
astride
So a verb like astonish
has /DSLG
there, and the en_US.aff
file has enough information to be able to derive astonishes
, astonished
, or /Y
to derive astoundingly
from astounding
. I don't know how it works, but hunspell is widely used by LibreOffice / Firefox / Enchant / etc.
A minimum viable version would probably just ignore the affixes, and use the plain words from the .dic
files. I don't know if libenchant allows us to consume this raw data, or if it only has a "look up this word" kind of API.
A more complete version would explode the affixes and stuff everything in our big word list.
Of course, those dictionaries don't have priorities like Peter Broda's wordlist.