Skip to content
Snippets Groups Projects
Select Git revision
  • master
1 result

python_cchardet

  • Clone with SSH
  • Clone with HTTPS
  • Micah Anderson's avatar
    micah authored
    and for Cython modifying sources on build (src/cchardet/_cchardet.cpp)
    2f4691c7
    History

    cChardet

    cChardet is high speed universal character encoding detector. - binding to charsetdetect.

    Support codecs

    • Big5
    • EUC-JP
    • EUC-KR
    • GB18030
    • HZ-GB-2312
    • IBM855
    • IBM866
    • ISO-2022-CN
    • ISO-2022-JP
    • ISO-2022-KR
    • ISO-8859-2
    • ISO-8859-5
    • ISO-8859-7
    • ISO-8859-8
    • KOI8-R
    • Shift_JIS
    • TIS-620
    • UTF-8
    • UTF-16BE
    • UTF-16LE
    • UTF-32BE
    • UTF-32LE
    • WINDOWS-1250
    • WINDOWS-1251
    • WINDOWS-1252
    • WINDOWS-1253
    • WINDOWS-1255
    • EUC-TW
    • X-ISO-10646-UCS-4-2143
    • X-ISO-10646-UCS-4-3412
    • x-mac-cyrillic

    Requires

    e.g.) Ubuntu 12.04

    $ sudo apt-get install build-essential python-dev cython

    Installation

    $ cd /tmp
    $ git clone git://github.com/PyYoshi/cChardet.git
    $ cd cChardet
    $ python setup.py build
    $ sudo python setup.py install

    or

    $ sudo easy_install cchardet

    Example

    # -*- coding: utf-8 -*-
    import cchardet as chardet
    with open(r"test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f:
        msg = f.read()
    result = chardet.detect(msg)
    print(result)

    Test

    $ sudo easy_install or pip install -U chardet nose
    $ cd test
    $ nosetests --nocapture tests.py

    Benchmark

    code: tests.TestCchardetSpeed

    sample: test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt

    Performance:

    CPU: Intel Core i7 860 2.8GHz

    RAM: DDR3-1333 16GB

    Platform: Kubuntu 12.04 amd64, Python 2.7.3 64-bit

    Result:

    chardet:        0.32 (call/s)
    
    cchardet:       975.32 (call/s)

    License

    • The MIT License: src/cchardet
    • Other Libraries License: Please, look at the src/ext directory.

    Thanks

    Contact

    My blog

    Issues

    Sorry for my poor English :)