Improve pt-br wordlist (#63)
Wordlist in pt-br was first introduced in 7743ed55. The differences to this one are: - 9-characters words are introduced. - suffixes removal is made after accounting for popularity. - less frequent words that differ only in the last character are removed. The current pt-br wordlist was generated as follows: 1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word. 2. Start from /usr/share/dict/brazilian and filter out: - words not matching /^[a-z]+$/, - words shorter than 4 characters, and - words longer than 9 characters. 3. Sort remaining words using pt Wikipedia frequencies. 4. Take the top 30K words (just because after filtering we still get roughly the amount we need). 5. Filter out: - all words that are a suffix of any other word in the list. - less frequent words that differ only by the last character. 6. Take the 7776 most frequent words. No further curation was made.
Loading
Please register or sign in to comment