I have a web app where I need to be able to serve the user an archive of multiple files. I've set up a generic ArchiveExporter
, and made a ZipArchiveExporter
. Works beautifully! I can stream my data to my server, and archive the data and stream it to the user all without using much memory, and without needing a filesystem (I'm on Google App Engine).
Then I remembered about the whole zip64 thing with 4gb zip files. My archives can get potentially very large (high res images), so I'd like to have an option to avoid zip files for my larger input.
I checked out org.apache.commons.compress.archivers.tar.TarArchiveOutputStream
and thought I had found what I needed! Sadly when I checked the docs, and ran into some errors; I quickly found out you MUST pass the size of each entry as you stream. This is a problem because the data is being streamed to me with no way of knowing the size beforehand.
I tried counting and returning the written bytes from export()
, but TarArchiveOutputStream
expects a size in TarArchiveEntry
before writing to it, so that obviously doesn't work.
I can use a ByteArrayOutputStream
and read each entry entirely before writing its content so I know its size, but my entries can pontentially get very large; and this is not very polite to the other processes running on the instance.
I could use some form of persistence, upload the entry, and query the data size. However, that would be a waste of my google storage api calls, bandwidth, storage, and runtime.
I am aware of this SO question asking almost the same thing, but he settled for using zip files and there is no more relevant information.
What is the ideal solution to creating a tar archive with entries of unknown size?
public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
public abstract void export(OutputStream out); //from Exporter interface
public abstract void archiveItems(T t) throws IOException;
}
public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
@Override
public void export(OutputStream out) throws IOException {
try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
zos.setLevel(0);
archiveItems(zos);
}
}
@Override
protected void archiveItems(ZipOutputStream zos) throws IOException {
zos.putNextEntry(new ZipEntry(exporter.getFileName()));
exporter.export(zos);
//chained call to export from other exporter like json exporter for instance
zos.closeEntry();
}
}
public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
@Override
public void export(OutputStream out) throws IOException {
try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
archiveItems(taos);
}
}
@Override
protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
//entry.setSize(?);
taos.putArchiveEntry(entry);
exporter.export(taos);
taos.closeArchiveEntry();
}
}
EDIT this is what I was thinking with the ByteArrayOutputStream
. It works, but I cannot guarantee I will always have enough memory to store the whole entry at once, hence my streaming efforts. There has to be a more elegant way of streaming a tarball! Maybe this is a question more suited for Code Review?
protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
exporter.export(baos);
byte[] data = baos.toByteArray();
//holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
int len = data.length;
entry.setSize(len);
taos.putArchiveEntry(entry);
taos.write(data);
taos.closeArchiveEntry();
}
}
EDIT This is what I meant by uploading the entry to a medium (Google Cloud Storage in this case) to accurately query the whole size. Seems like major overkill for what seems like a simple problem, but this doesn't suffer from the same ram problems as the solution above. Just at the cost of bandwidth and time. I hope someone smarter than me comes by and makes me feel stupid soon :D
protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
String name = NameHelper.getRandomName(); //get random name for temp storage
BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
try(OutputStream out = Channels.newOutputStream(wbc)) {
exporter.export(out); //stream items to remote temp file
} finally {
wbc.close();
}
Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
long size = blob.getSize(); //accurately query the size after upload
entry.setSize(size);
taos.putArchiveEntry(entry);
ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
try(InputStream in = Channels.newInputStream(rbc)) {
IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file
} finally {
rbc.close();
}
blob.delete(); //delete remote temp file
taos.closeArchiveEntry();
}