How to Ignore Gzip Trailing Garbage Data in Python
Why does the Python gzip module raise an "OSError: Not a gzipped file" exception when other tools and libraries seem to work without error? Ultimately, this wasn't entirely accurate, but the question was asked by colleagues that were using DotNetZip, 7-Zip, and WinRAR to decompress a file without even a hint of an error.
Unfortunately, Python's gzip module wasn't as lucky forgiving:
>>> gzf = gzip.GzipFile('trailing-garbage.gz')
>>> gzf.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\gzip.py", line 360, in read
while self._read(readsize):
File "C:\Python34\lib\gzip.py", line 433, in _read
if not self._read_gzip_header():
File "C:\Python34\lib\gzip.py", line 297, in _read_gzip_header
raise OSError('Not a gzipped file')
OSError: Not a gzipped file
Compared to gzip:
[nick@slug ~]$ gunzip -c trailing-garbage.gz > /dev/null
gzip: trailing-garbage.gz: decompression OK, trailing garbage ignored
[nick@slug ~]$ gunzip -c -q trailing-garbage.gz | wc -c
36
Gzip decompressed 36 bytes, and alerted us on stderr that there were also trailing bytes being ignored.
Going back to Python's gzip module, the internal buffers confirmed that all the data was also being decompressed before the exception was raised:
>>> len(gzf.extrabuf)
36
Now, compare the above with DotNetZip's command line utility:
GZip.exe: compress a file using GZip, or decompress a GZip-compressed file.
The original file is deleted after processing.
This tool depends on Ionic's DotNetZip library. This is version 1.9.1.8
of the utility. See http://dotnetzip.codeplex.com for info.
PS> .\GZip.exe .\trailing-garbage.gz -v -keep
Original : 79 bytes
Decompressed: 25 bytes
Comp Ratio : -216.0%
It only extracted the first member and silently discarded everything else. I'm not sure if that should count as working?
And finally, the obligatory hex dump:
Using a hex editor, we can see there are markers for 2 members with gzip headers before encountering the trailing "garbage" data. Multiple member support is implemented in the canonical version of gzip, as well as described here.
What about 7-Zip and WinRAR?
With the exception of not alerting that trailing data was being ignored, both 7-Zip and WinRAR had seemingly identical behavior to gzip, including --list characteristics for multiple members.
So how do we get similar behavior from the Python gzip module?
Thanks to Martin Panter's suggestion, we can do a pretty reasonable workaround that will work on 2.7 (with a small monkey patch), 3.3, 3.4, and even the current beta2 release of 3.5.
>>> import altgzip
>>> with altgzip.AltGzipFile('trailing-garbage.gz') as gz:
... data = gz.read()
...
decompression OK, trailing garbage ignored
>>> len(data)
36