Improve pt-br wordlist (#63)
Wordlist in pt-br was first introduced in 7743ed55. The differences to this one are: - 9-characters words are introduced. - suffixes removal is made after accounting for popularity. - less frequent words that differ only in the last character are removed. The current pt-br wordlist was generated as follows: 1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word. 2. Start from /usr/share/dict/brazilian and filter out: - words not matching /^[a-z]+$/, - words shorter than 4 characters, and - words longer than 9 characters. 3. Sort remaining words using pt Wikipedia frequencies. 4. Take the top 30K words (just because after filtering we still get roughly the amount we need). 5. Filter out: - all words that are a suffix of any other word in the list. - less frequent words that differ only by the last character. 6. Take the 7776 most frequent words. No further curation was made.
This diff is collapsed.
Please register or sign in to comment