How to use Python3.6 tarfile module to read from memory?
Asked Answered
P

2

7

I would like to download a tarfile from url to memory and than extract all its content to folder dst. What should I do?

Below are my attempts but I could not achieve my plan.

#!/usr/bin/python3.6
# -*- coding: utf-8 -*-

from pathlib import Path
from io import BytesIO
from urllib.request import Request, urlopen
from urllib.error import URLError
from tarfile import TarFile


def get_url_response( url ):
    req = Request( url )
    try:
        response = urlopen( req )
    except URLError as e:
        if hasattr( e, 'reason' ):
            print( 'We failed to reach a server.' )
            print( 'Reason: ', e.reason )
        elif hasattr( e, 'code'):
            print( 'The server couldn\'t fulfill the request.' )
            print( 'Error code: ', e.code )
    else:
        # everything is fine
        return response

url = 'https://dl.opendesktop.org/api/files/download/id/1566630595/s/6cf6f74c4016e9b83f062dbb89092a0dfee862472300cebd0125c7a99463b78f4b912b3aaeb23adde33ea796ca9232decdde45bb65a8605bfd8abd05eaee37af/t/1567158438/c/6cf6f74c4016e9b83f062dbb89092a0dfee862472300cebd0125c7a99463b78f4b912b3aaeb23adde33ea796ca9232decdde45bb65a8605bfd8abd05eaee37af/lt/download/Blue-Maia.tar.xz'
dst = Path().cwd() / 'Tar'

response = get_url_response( url )

with TarFile( BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

However, I got this error:

Traceback (most recent call last):
  File "~/test_tar.py", line 31, in <module>
    with TarFile( BytesIO( response.read() ) ) as tfile:
  File "/usr/lib/python3.6/tarfile.py", line 1434, in __init__
    fileobj = bltn_open(name, self._mode)
TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO

I tried passing the BytesIO object to TarFile as a fileobj:

with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

However, it still can't work:

Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 188, in nti
    s = nts(s, "ascii", "strict")
  File "/usr/lib/python3.6/tarfile.py", line 172, in nts
    return s.decode(encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd2 in position 0: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 2297, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.6/tarfile.py", line 1093, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.6/tarfile.py", line 1035, in frombuf
    chksum = nti(buf[148:156])
  File "/usr/lib/python3.6/tarfile.py", line 191, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/test_tar.py", line 31, in <module>
    with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
  File "/usr/lib/python3.6/tarfile.py", line 1482, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.6/tarfile.py", line 2309, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header
Pippo answered 30/8, 2019 at 9:23 Comment(0)
K
5

This approach was very close to correct:

with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

You should use tarfile.open instead of TarFile (see docs), and tell it that you are reading an xz file (mode='r:xz'):

with tarfile.open( fileobj=BytesIO( response.read() ), mode='r:xz' ) as tfile:
    tfile.extractall( path=dst )

However, as you'll notice, this is still not enough.

The root problem? You're downloading from a site which disallows hotlinking. The website is blocking your attempt to download. Try printing out the response and you'll see you get a load of junk HTML instead of a tar.xz file.

Kettle answered 30/8, 2019 at 9:58 Comment(1)
I used another .tar.xz type url that allowed downloading. Yes, using the tarfile.open() function worked. Thank you also for the reference, I overlooked it. Any chance/way to circumvent the hotlinking?Pippo
T
2

Strangely, I manage to make it work using the open() function, but not by instanciating a TarFile object. It seems the opening mode can not be set correctly in the second one...

Anyway, this works:

from _io import BytesIO
import tarfile

with open('Blue-Maia.tar.xz', 'rb') as f:
    tar = tarfile.open(fileobj=BytesIO( f.read() ), mode="r:xz")
    tar.extractall( path="test" )
    tar.close()

You could add a try...except...finally to ensure the tar file is always closed.

Update:

In your code:

response = get_url_response( url )
tar = tarfile.open(fileobj=BytesIO( response.read() ), mode="r:xz")
tar.extractall( path="test" )
tar.close()
Tutu answered 30/8, 2019 at 9:32 Comment(9)
Does your approach write to memory? I did not see BytesIO used. Can you please explain?Pippo
Btw, your with open() statement returned FileNotFoundError: [Errno 2] No such file or directory: 'Blue-Maia.tar.xz'Pippo
Oh sorry, I tried a few things and I made a mistake while posting the solution... FixedTutu
And the with open is just here to replace the fileobject you get with your get_url_response method, the lines you need are the last 3Tutu
Can you show how your code links with response from my code? I can't see it with your script. The statement with open('Blue-Maia.tar.xz', 'rb') as f means that you are opening a file called "Blue-Maia.tar.xz" which pre-exist in your current working directory and you are assigning this opened file to f.Pippo
I did it because a filelike object (resulting from the with open) has the very same read() method that your response object, it was just to shorten the code and got to the essential... However, I updated my answer.Tutu
With your updated code, I got tarfile.ReadError: not an lzma file.Pippo
Ah, strange. But the same error happens whith the code from the accepted answer, why did you accept it if this is not working?Tutu
Thanks for helping. Your updated answer finally looks similar to the answer by @Score_Under. I accepted that answer because it first explained my mistake and showed the correct syntax that I should have used, which answered my question. ;)Pippo

© 2022 - 2024 — McMap. All rights reserved.