I am trying to build an endpoint in Kotlin which accepts 10+ GB tar files, and processes the contents one by one.
The Tar file contains millions of JSON files, and I am running this application in a Docker container with very limited disk size, so extracting the entire archive to a temporary directory is not an option.
The following approach with Apache Compress:
post(...) {
val multipart = call.receiveMultipart()
multipart.forEachPart { part ->
if (part is PartData.FileItem) {
part.streamProvider().use { inputStream ->
BufferedInputStream(inputStream).use { bufferedInputStream ->
TarArchiveInputStream(bufferedInputStream).use { tarInput ->
leads to java.io.IOException: Corrupted TAR archive.
error due to providing the tar
data in stream instead of a one huge variable that contains all bytes. I also cannot consume the entire input stream into one ByteArray
variable and provide it to BufferedInputStream
because I don't have 20 gigs of memory.
Any help is appreciated.
- inputStream ->
java.io.InputStream
- bufferedStream ->
java.io.BufferedStream
The example code doesn't contain any special types belonging to Kotlin or Ktor.
Update
Longer example by sending the file as the POST body:
call.receiveStream().buffered().use { bufferedInputStream ->
TarArchiveInputStream(bufferedInputStream).use { tarInput ->
var entry = tarInput.nextEntry
while (entry != null) {
if (!entry.isDirectory && entry.name.endsWith(".json")) {
scope.launch {
val jsonString = tarInput.bufferedReader().readText()
val json: Map<String, JsonElement> =
Json.parseToJsonElement(jsonString).jsonObject
Update after Answer
// any stream that implements java.io.InputStream
val bodyStream = call.receiveStream()
val elems = sequence {
bodyStream.buffered().use { bufferedInputStream ->
TarArchiveInputStream(bufferedInputStream).use { tarInput ->
while (true) {
val entry = tarInput.nextEntry ?: break
// do something with entry, yield that something, and process that something later.
yield()
}
}
}
}
Problem was asynchronously processing the tar, its explained in detail at accepted answer.
curl -v -F upload=@... http:/0.0.0.0:8080/...
, but I get the same "Corrupted TAR archive" error both in small sample and large samples. – Disencumberjava.io.BufferedReader
might brick the entire tar input? paste.com.tr/raw/rkswbpoa – DisencumberBufferedInputStream
is a very common and well tested utility used by thousands of applications every day. It is fine. Please start by making sure you actually receive the raw data that you send. My guess would be either: 1. Network problems. 2. Multipart processing. I don't know, maybe bigger files are split into multiple parts? 3. Some inconsistency between how you receive the data and how you send it. In some cases curl automatically converts new lines in the sent files, maybe this is it? – Sensualism--data-binary
instead of-F
. You would have to adapt the server side as well. – Sensualism