Skip to content
Snippets Groups Projects
  1. Mar 27, 2021
  2. Mar 24, 2021
    • ulif's avatar
      Update README. · ad66e31d
      ulif authored
      Reflect changes in output of ``--help``.
      ad66e31d
  3. Mar 16, 2021
  4. Mar 15, 2021
  5. Sep 25, 2020
  6. Aug 24, 2020
  7. Aug 19, 2020
  8. Jul 26, 2020
  9. Apr 24, 2020
  10. Apr 20, 2020
  11. Apr 19, 2020
  12. Apr 16, 2020
  13. Jan 17, 2020
  14. Dec 21, 2019
  15. May 28, 2019
    • drebs's avatar
      Improve pt-br wordlist (#63) · 9c101bee
      drebs authored and ulif's avatar ulif committed
      Wordlist in pt-br was first introduced in 7743ed55. The differences to
      this one are:
      
        - 9-characters words are introduced.
        - suffixes removal is made after accounting for popularity.
        - less frequent words that differ only in the last character are
          removed.
      
      The current pt-br wordlist was generated as follows:
      
        1. Download a dump of portuguese Wikipedia pages, process all pages
           and determine the frequency of each word.
        2. Start from /usr/share/dict/brazilian and filter out:
             - words not matching /^[a-z]+$/,
             - words shorter than 4 characters, and
             - words longer than 9 characters.
        3. Sort remaining words using pt Wikipedia frequencies.
        4. Take the top 30K words (just because after filtering we still get
           roughly the amount we need).
        5. Filter out:
             - all words that are a suffix of any other word in the list.
             - less frequent words that differ only by the last character.
        6. Take the 7776 most frequent words.
      
      No further curation was made.
      9c101bee
  16. Apr 26, 2019
    • ulif's avatar
      Update year. · 6a1a7623
      ulif authored
      Next release won't be in 2018.
      6a1a7623
    • drebs's avatar
      Add pt-br wordlist (#60) · 7743ed55
      drebs authored and ulif's avatar ulif committed
      The wordlist was generated from 2 different sources of words:
      
        - The file /usr/share/dict/brazilian from Debian's wbrazilian package.
        - A dump of the pages of Wikipedia in portuguese.
      
      The final pt-br wordlist was generated as follows:
      
        1. Download a dump of portuguese Wikipedia pages, process all pages
           and determine the frequency of each word.
        2. Start from /usr/share/dict/brazilian and filter out:
             - words not matching /^[a-z]+$/,
             - words shorter than 4 characters, and
             - words longer than 8 characters.
        3. Remove all words that are a suffix of any other word in the list.
        4. Sort remaining words using pt Wikipedia frequencies.
        5. Take the 7776 most frequent words.
      
      No further curation was made.
      
      There are obvious drawbacks in this approach (eg: many very frequent
      words are left out because they are either too short or too long or
      contain accents or cedilla), but it was the best cost-benefit i could
      think about.
      7743ed55
  17. Dec 19, 2018
  18. Dec 12, 2018
  19. Dec 11, 2018
  20. Apr 14, 2018
  21. Apr 07, 2018
Loading