How do I extract data from a .tar.gz file (stored in the cloud) from a browser
Asked Answered
O

2

5

Problem

I am making a single page application that will be storing its data in one of the major cloud provider's blob storage (for example goggle cloud storage). The data in the cloud storage is a .tar.gz file, and I want to access this from a browser app .

Inside the tar file there will be hundreds of files, and I just want to get one of these files and render it into HTML. I can already load the file, it's just 'how do I get the data out of it'.

Unsurprisingly I am currently using typescript/javascript in the single page application, but that could change if the answer was 'do it this way'.

I'm not worried about browser compatibility (I can specify things like 'only works in this browser), but the browser doesn't have access to a file system and I can't 'shell out' to the operating system

What I have tried

I've had a look for npm packages, and the closest I've come to is https://github.com/npm/node-tar (but that seems to need a file system). I am reasonably confident working with streams, but it doesn't feel (after reviewing the documentation) that zlib will do what I want 'out of the box'. I didn't get a lot of hits from google searching: most just gave the same advice I would: 'shell out to the operating system and have that do it with tar', but I can't follow that advice in the browser

My alternative

If this doesn't work I will put a lambda/function in place to do the de-tarring. I like avoiding 'more moving parts' if I can in a project, but this might be needed.

Overwinter answered 25/12, 2020 at 8:45 Comment(0)
C
9

The result should be achievable by using a combination of pako (a fast zlib JavaScript port) and js-untar:

<script src="pako.min.js"></script>
<script src="untar.js"></script>
<script>
fetch('test.tar.gz').then(res => res.arrayBuffer()) // Download gzipped tar file and get ArrayBuffer
                    .then(pako.inflate)             // Decompress gzip using pako
                    .then(arr => arr.buffer)        // Get ArrayBuffer from the Uint8Array pako returns
                    .then(untar)                    // Untar
                    .then(files => {                // js-untar returns a list of files (See https://github.com/InvokIT/js-untar#file-object for details)
                        console.log(files);
                    });
</script>

test.tar.gz was made by running tar -czvf test.tar.gz test on a directory with 3 text files in it, to be able to check that both directories and files show up in the result.

Cowbind answered 25/12, 2020 at 14:3 Comment(5)
This looks very nice, and thank you for the code. Unfortunately I need this to run in a browser and (according to the last comment) this returns files. I am looking for a solution that works on places that don't have a file system. I am probably going to have to do this by a function I suspect, so I'll give this a try when I do thatOverwinter
I spoke too soon: I should have looked at the code. js-untar returns a list of file objects, not files, and importantly it provides access to the content of the file using a buffer. it's exactly what I wanted thank you. I'll give it a test over the next two days and report back how it wentOverwinter
Thank you for the help. I am a little worried that js-untar is 3 years without a commit. It also only works in the browser. It would be nice to find a solution that works in the browser and on node (that's just for 'I like my code to be decoupled from the environment as much as possible' reasons). This though gets me through my immediate task and I can look for a nicer solution later.Overwinter
@StaveEscura did you ever find a js-untar alternative?Billups
Note that the tar format itself hasn't changed in decades: there is literally zero reason for the JS implementation to get updates, given that old JS is guaranteed to keep working forever. Also note that js-untar works in Node and the browser just fine. It yields a data object that is platform agnostic and that can be trivially worked with in both contexts.Cephalo
W
5

Similar to @Lasse's answer but fewer dependencies and perf improvement:

  1. You can replace pako with Browser's built-in decompression API.
  2. Piping fetch stream into decompression stream, so you are decompressing while fetching is still in progress.
  3. In addition, I recommend tarballjs, which is a cleaner untar implementation in my opinion and has recent repo activity. It is so simple that you can pick up the maintenance if the author quits.

// CORS Anywhere is needed for downloading from GitHub. Visit https://cors-anywhere.herokuapp.com for details
const fetchSampleBlob = () => fetch("https://cors-anywhere.herokuapp.com/https://github.com/ankitrohatgi/tarballjs/tarball/master", {headers: {"X-Requested-With": "https://github.com"}})

const fetchStreamToDecompressionStream = (response) => response.body.pipeThrough(new DecompressionStream("gzip"));

const decompressionStreamToBlob = (decompressedStream) => new Response(decompressedStream).blob();

const blobToDir = (blob) => new tarball.TarReader().readFile(blob)

 

fetchSampleBlob()
  .then(fetchStreamToDecompressionStream)
  .then(decompressionStreamToBlob)
  .then(blobToDir)
  .then(console.log); // you should see a few files from the downloaded git repo tarball


/**
 * Output
 *
 * [
 *  {
 *    "name": "pax_global_header",
 *    "type": "g",
 *    "size": 52,
 *    "header_offset": 0
 *  },
 *  ...
 * ]
 */
<!-- This is served from author's server. You need to find a different host for performance and security reasons -->
<script src="https://arohatgi.info/tarballjs/tarball.js"></script>
Weapon answered 12/11, 2022 at 9:1 Comment(1)
This answer no longer works. We never make it to the console log.Cephalo

© 2022 - 2024 — McMap. All rights reserved.