mat2 issueshttps://0xacab.org/jvoisin/mat2/-/issues2019-07-13T13:05:19Zhttps://0xacab.org/jvoisin/mat2/-/issues/109Improve zip compression2019-07-13T13:05:19ZjvoisinImprove zip compressionAs mentionned in #107, mat2 is currently using the default `ZIP_STORED` compression method for all zipfiles.
Maybe we should instead use the same method as the one used by the file being cleaned. This would make fingerprinting a bit easi...As mentionned in #107, mat2 is currently using the default `ZIP_STORED` compression method for all zipfiles.
Maybe we should instead use the same method as the one used by the file being cleaned. This would make fingerprinting a bit easier, but could also dramatically decrease produced archive sizes. I think that it's worth it.
This is a good first issue, since it involves designing a proper integration of this feature in the already-quite-complex archive handling code :)1.0 - Ponyhttps://0xacab.org/jvoisin/mat2/-/issues/101PATH management is quite messy2020-03-04T16:26:54ZfuzzyPATH management is quite messyOn several occasion, we are calling some external programs.
The paths are hardcoded where it should rely on whatever is set in `$PATH` environment variables.
This leads to crash on *BSD+ports and MacOS+brew or macports as the bins are ...On several occasion, we are calling some external programs.
The paths are hardcoded where it should rely on whatever is set in `$PATH` environment variables.
This leads to crash on *BSD+ports and MacOS+brew or macports as the bins are usually found in `/usr/local`.
Also the get_path pattern is repeated in several files (check possible paths + checking if we can execute). This needs to be factorized.https://0xacab.org/jvoisin/mat2/-/issues/96Rename black/white list to block/allow list2019-03-06T20:07:33ZjvoisinRename black/white list to block/allow list1.0 - Ponyhttps://0xacab.org/jvoisin/mat2/-/issues/82Refactor how we're dealing with filenames starting with a dash2018-10-23T14:36:05ZjvoisinRefactor how we're dealing with filenames starting with a dashCurrently, some parsers are copying files starting with a dash to walk around the possible option injection issue. MAT2 should try to (temporarily) rename those files first, and only fall back to copying in a temporary location upon fail...Currently, some parsers are copying files starting with a dash to walk around the possible option injection issue. MAT2 should try to (temporarily) rename those files first, and only fall back to copying in a temporary location upon failure. This will vastly improve performance for video format processing, and on network-backed storage. It might also clean up a bit the code.0.5.0 - Slughttps://0xacab.org/jvoisin/mat2/-/issues/73Display metadata from files embedded in office documents2018-11-10T12:38:25ZjvoisinDisplay metadata from files embedded in office documentsCurrently, mat2 isn't displaying the metadata from embedded files in office documents. Currently, it's only showing the metadata of the archive, handled in a flat dict.
I think that we might use a nested dict structure to handle this:
...Currently, mat2 isn't displaying the metadata from embedded files in office documents. Currently, it's only showing the metadata of the archive, handled in a flat dict.
I think that we might use a nested dict structure to handle this:
```json
{'my_file.docx':
'author': 'jvoisin',
'my_picture.png': {
'producer': 'the GIMP'
},
'creation_date': 'yesterday'
}
```
Or a flat dict, with prefixes:
```json
{'author': 'jvoisin',
'(my_picture) producer': 'the GIMP',
'creation_date': 'yesterday'
}
```
But I'm open to other suggestions :)0.6.0 - Slothjvoisinjvoisinhttps://0xacab.org/jvoisin/mat2/-/issues/66Simplify the cli's parallel processing implementation2019-07-19T22:19:27ZjvoisinSimplify the cli's parallel processing implementationCurrently, mat2 [is using]( https://0xacab.org/jvoisin/mat2/blob/master/mat2 ) the [multiprocessing]( https://docs.python.org/3/library/multiprocessing.html ) module, to create a `Pool` and apply an [`imap_unordered`]( https://docs.pytho...Currently, mat2 [is using]( https://0xacab.org/jvoisin/mat2/blob/master/mat2 ) the [multiprocessing]( https://docs.python.org/3/library/multiprocessing.html ) module, to create a `Pool` and apply an [`imap_unordered`]( https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered ) on it, with some [itertools]( https://docs.python.org/3/library/itertools.html )'s magic. I'm convinced that we should be able to write a more readable implementation, likely by using the fancy [asyncio]( https://docs.python.org/3/library/asyncio.html ) module of Python3 instead.1.0 - Ponyhttps://0xacab.org/jvoisin/mat2/-/issues/60Use enum for the unknown file policy2018-09-06T09:13:29ZjvoisinUse enum for the unknown file policyModern python version are supporting [enum]( https://docs.python.org/3/library/enum.html ), we should use them instead of hardcoding strings.Modern python version are supporting [enum]( https://docs.python.org/3/library/enum.html ), we should use them instead of hardcoding strings.0.4.0 - Dolphinhttps://0xacab.org/jvoisin/mat2/-/issues/58with defusedxml installed, mat2 on a simple .docx fails2018-09-05T16:42:13Zdkgwith defusedxml installed, mat2 on a simple .docx failsnow that i have `python3-defusedxml` installed, i get a failure with mat2 on a simple
[hello-world.docx](/uploads/f259765884a3553fdf9d0882916ca492/hello-world.docx)
```
0 dkg@alice:~/src/mat$ mat2 hello-world.docx
multiprocessing.pool...now that i have `python3-defusedxml` installed, i get a failure with mat2 on a simple
[hello-world.docx](/uploads/f259765884a3553fdf9d0882916ca492/hello-world.docx)
```
0 dkg@alice:~/src/mat$ mat2 hello-world.docx
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/dkg/src/mat/mat2/mat2", line 85, in clean_meta
return p.remove_all()
File "/home/dkg/src/mat/mat2/libmat2/office.py", line 100, in remove_all
if self._specific_cleanup(full_path) is False:
File "/home/dkg/src/mat/mat2/libmat2/office.py", line 210, in _specific_cleanup
return self.__remove_revisions(full_path)
File "/home/dkg/src/mat/mat2/libmat2/office.py", line 173, in __remove_revisions
tree, namespace = _parse_xml(full_path)
File "/home/dkg/src/mat/mat2/libmat2/office.py", line 28, in _parse_xml
ET.register_namespace(key, value)
AttributeError: module 'defusedxml.ElementTree' has no attribute 'register_namespace'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/dkg/src/mat/mat2/mat2", line 154, in <module>
sys.exit(main())
File "/home/dkg/src/mat/mat2/mat2", line 150, in main
ret = list(p.imap_unordered(clean_meta, list(l)))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
AttributeError: module 'defusedxml.ElementTree' has no attribute 'register_namespace'
1 dkg@alice:~/src/mat$
```jvoisinjvoisinhttps://0xacab.org/jvoisin/mat2/-/issues/28Using a proper logging system2018-07-10T19:31:10ZjvoisinUsing a proper logging systemCurrently, MAT2 is using `print()` everywhere, this isn't coolâ„¢.Currently, MAT2 is using `print()` everywhere, this isn't coolâ„¢.2.0 - Eaglejvoisinjvoisinhttps://0xacab.org/jvoisin/mat2/-/issues/27`pkgutil.walk_packages` not working when loading libmat2 from `PYTHONPATH`2018-06-21T21:36:01ZJonas`pkgutil.walk_packages` not working when loading libmat2 from `PYTHONPATH`Hello,
now that `setup.py` supports installing mat2 into the operating system, the way libmat2 parser modules are included needs to be changed. The relevant part of `libmat2/parser_factory.py`:
```python
# This loads every parser in a ...Hello,
now that `setup.py` supports installing mat2 into the operating system, the way libmat2 parser modules are included needs to be changed. The relevant part of `libmat2/parser_factory.py`:
```python
# This loads every parser in a dynamic way
for module_loader, name, ispkg in pkgutil.walk_packages('.libmat2'):
if not name.startswith('libmat2.'):
continue
elif name == 'libmat2.abstract':
continue
importlib.import_module(name)
```
This doesn't work with libmat2 being in `PYTHONPATH` as [`pkgutil.walk_packages`](https://docs.python.org/3/library/pkgutil.html#pkgutil.walk_packages) doesn't search in `PYTHONPATH` if you give it a path (`.libmat2`).
The best solution I could find is documented here: https://stackoverflow.com/questions/1707709/list-all-the-modules-that-are-part-of-a-python-package/1707786#17077860.1.2 - Duckjvoisinjvoisinhttps://0xacab.org/jvoisin/mat2/-/issues/10Factorise a bit the codebase2018-03-31T23:04:31ZjvoisinFactorise a bit the codebaseThere is a some duplicate code in `docx` and `odt` processing. It would be nice to factorise this.There is a some duplicate code in `docx` and `odt` processing. It would be nice to factorise this.0.1 - Turtlejvoisinjvoisin