mat2 issueshttps://0xacab.org/jvoisin/mat2/-/issues2023-09-07T14:51:11Zhttps://0xacab.org/jvoisin/mat2/-/issues/194Non-utf8 PDF can't be lightweight-cleaned2023-09-07T14:51:11ZjvoisinNon-utf8 PDF can't be lightweight-cleanedA user reported privately that the following file can't be cleaned: [fail_b_올해_상반기_한국_입국_탈북민_99명-향후_입국_추이_지켜봐야____RFA_자유아시아방송_Safari.pdf](/uploads/b35f7f486813a9b232d71323810b4b5b/fail_b_올해_상반기_한국_입국_탈북민_99명-향후_입국_추이_지켜봐야____RFA_자유아시아방송_...A user reported privately that the following file can't be cleaned: [fail_b_올해_상반기_한국_입국_탈북민_99명-향후_입국_추이_지켜봐야____RFA_자유아시아방송_Safari.pdf](/uploads/b35f7f486813a9b232d71323810b4b5b/fail_b_올해_상반기_한국_입국_탈북민_99명-향후_입국_추이_지켜봐야____RFA_자유아시아방송_Safari.pdf)
```console
$ mat2 ./mat2 -L ./fail\ b\ 올해\ 상반기\ 한국\ 입국\ 탈북민\ 99명-향후\ 입국\ 추이\ 지켜봐야”\ —\ RFA\ 자유아시아방송\ Safari.pdf
[-] ./fail b 올해 상반기 한국 입국 탈북민 99명-향후 입국 추이 지켜봐야” — RFA 자유아시아방송 Safari.pdf can't be cleaned: input string not valid UTF-8
[255]
$
```
It seems that cairo/poppler do not like CJK stuff, sigh.https://0xacab.org/jvoisin/mat2/-/issues/165WebP support?2023-03-07T11:36:44ZRachel VeerWebP support?Came across a webp image I wanted *clean*, but mat2 says webp isn't supported. A simple workaround is saving in another format (ala .jpg/png) and cleaning from there, though I still wanted to inquire: will there be any webp support in th...Came across a webp image I wanted *clean*, but mat2 says webp isn't supported. A simple workaround is saving in another format (ala .jpg/png) and cleaning from there, though I still wanted to inquire: will there be any webp support in the future?https://0xacab.org/jvoisin/mat2/-/issues/164ICO format support2022-03-16T19:35:13ZRomainICO format supportA user on Metadata Cleaner's matrix channel [asked about metadata in ICO files](https://matrix.to/#/!XDGcWYIURqLwjtlmIW:gnome.org/$bR5Rtm3k7KfkwcOHrDyGDv1gsujgX1aABPJdrir0hfA?via=matrix.org&via=gnome.org&via=t2bot.io) (the format used by...A user on Metadata Cleaner's matrix channel [asked about metadata in ICO files](https://matrix.to/#/!XDGcWYIURqLwjtlmIW:gnome.org/$bR5Rtm3k7KfkwcOHrDyGDv1gsujgX1aABPJdrir0hfA?via=matrix.org&via=gnome.org&via=t2bot.io) (the format used by Windows applications for their icons and websites for their favicons).
According to Wikipedia, the format can [store PNG images](https://en.wikipedia.org/wiki/ICO_(file_format)#PNG_format), so in theory metadata could be passed that way.
After some testing with [icoutils](https://www.nongnu.org/icoutils/), when creating an ICO from a PNG containing metadata with the raw option (`icotool -c -r source.png -o icon.ico`) and extracting it again (`icotool -x icon.ico -o extracted.png`), the metadata is still present in the extracted PNG. However, the generated ICO appears to be broken. Without the raw option, metadata is stripped, but the ICO works.
It may be possible for mat2 to use `icotool` to extract images from `.ico` and `.cur` files and create a new file from them, but that would require some testing and studying the command line options as the format can include several images at different sizes plus some metadata, different if the file is an icon or a cursor.https://0xacab.org/jvoisin/mat2/-/issues/157Evaluate the relevance of mat2 wrt. the USA Library of Congress most used for...2023-05-03T20:42:27ZjvoisinEvaluate the relevance of mat2 wrt. the USA Library of Congress most used formatsThere is a [really nice paper]( https://osf.io/cxh9s/ ) ([local mirror](/uploads/726ca748875f2aaa54a01068c823cc09/39_Mark_Cooper_LP.pdf)) about the most used fileformats at the USA's Library of Congress. We should take a look at it, and ...There is a [really nice paper]( https://osf.io/cxh9s/ ) ([local mirror](/uploads/726ca748875f2aaa54a01068c823cc09/39_Mark_Cooper_LP.pdf)) about the most used fileformats at the USA's Library of Congress. We should take a look at it, and implement formats used there but not supported by mat2.
It boils down to:
- [ ] jp2
- [x] tif
- [x] jpg
- [ ] xml - we can't really support it
- [x] pdf
- [x] txt
- [x] gif
- [x] gz
- [ ] i41
- [ ] mxf
- [ ] mpg
- [ ] wav
- [ ] mov
- [ ] iso
- https://github.com/clalancette/pycdlib
- [ ] dv
- [x] gz
- [x] zip
- [ ] rar - python's library to handle this format, [rarfile](https://rarfile.readthedocs.io/api.html), doesn't provide enough control to remove all the metadata.
- [x] tarhttps://0xacab.org/jvoisin/mat2/-/issues/118Randomize xml:id in LibreOffice documents2020-06-17T21:08:42ZgeorgRandomize xml:id in LibreOffice documentsReading #71, I learnt that LibreOffice has a similar problem, which we should probably take care of.
http://officeopenxml.com/WPnumbering.phpReading #71, I learnt that LibreOffice has a similar problem, which we should probably take care of.
http://officeopenxml.com/WPnumbering.php1.0 - Ponyhttps://0xacab.org/jvoisin/mat2/-/issues/91Add support for RTF files2019-02-21T00:19:36ZjvoisinAdd support for RTF filesApparently, [it's possible]( https://en.wikipedia.org/wiki/Rich_Text_Format ) to embed metadata in rtf files.Apparently, [it's possible]( https://en.wikipedia.org/wiki/Rich_Text_Format ) to embed metadata in rtf files.1.0 - Pony