Skip to content
Snippets Groups Projects
Commit 7743ed55 authored by drebs's avatar drebs Committed by ulif
Browse files

Add pt-br wordlist (#60)

The wordlist was generated from 2 different sources of words:

  - The file /usr/share/dict/brazilian from Debian's wbrazilian package.
  - A dump of the pages of Wikipedia in portuguese.

The final pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
     and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
       - words not matching /^[a-z]+$/,
       - words shorter than 4 characters, and
       - words longer than 8 characters.
  3. Remove all words that are a suffix of any other word in the list.
  4. Sort remaining words using pt Wikipedia frequencies.
  5. Take the 7776 most frequent words.

No further curation was made.

There are obvious drawbacks in this approach (eg: many very frequent
words are left out because they are either too short or too long or
contain accents or cedilla), but it was the best cost-benefit i could
think about.
parent 65a801e6
No related branches found
No related tags found
No related merge requests found
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment