Add pt-br wordlist (#60)
The wordlist was generated from 2 different sources of words: - The file /usr/share/dict/brazilian from Debian's wbrazilian package. - A dump of the pages of Wikipedia in portuguese. The final pt-br wordlist was generated as follows: 1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word. 2. Start from /usr/share/dict/brazilian and filter out: - words not matching /^[a-z]+$/, - words shorter than 4 characters, and - words longer than 8 characters. 3. Remove all words that are a suffix of any other word in the list. 4. Sort remaining words using pt Wikipedia frequencies. 5. Take the 7776 most frequent words. No further curation was made. There are obvious drawbacks in this approach (eg: many very frequent words are left out because they are either too short or too long or contain accents or cedilla), but it was the best cost-benefit i could think about.
0 → 100644
This diff is collapsed.