Skip to content
Snippets Groups Projects
Commit 9c101bee authored by drebs's avatar drebs Committed by ulif
Browse files

Improve pt-br wordlist (#63)

Wordlist in pt-br was first introduced in 7743ed55. The differences to
this one are:

  - 9-characters words are introduced.
  - suffixes removal is made after accounting for popularity.
  - less frequent words that differ only in the last character are
    removed.

The current pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
     and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
       - words not matching /^[a-z]+$/,
       - words shorter than 4 characters, and
       - words longer than 9 characters.
  3. Sort remaining words using pt Wikipedia frequencies.
  4. Take the top 30K words (just because after filtering we still get
     roughly the amount we need).
  5. Filter out:
       - all words that are a suffix of any other word in the list.
       - less frequent words that differ only by the last character.
  6. Take the 7776 most frequent words.

No further curation was made.
parent 6a1a7623
No related branches found
No related tags found
No related merge requests found
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment