Edit/warning: there are potential gotchas with this solution, because it heavily uses MappedByteBuffer
, and it's unclear how/when the corresponding resources are released. See this Q&A & JDK-4724038 : (fs) Add unmap method to MappedByteBuffer.
That being said, please also see the end of this post
I would do exactly what Nim suggested:
wrap this in a class which maps in "blocks" and then moves the block along as you are writing .. The algorithm for this is fairly straightforward.. Just pick a block size that makes sense for the data you are writing..
In fact, I did exactly that years ago and just dug up the code, it goes like this (stripped to the bare minimum for a demo, with a single method to write data):
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class SlidingFileWriterThingy {
private static final long WINDOW_SIZE = 8*1024*1024L;
private final RandomAccessFile file;
private final FileChannel channel;
private MappedByteBuffer buffer;
private long ioOffset;
private long mapOffset;
public SlidingFileWriterThingy(Path path) throws IOException {
file = new RandomAccessFile(path.toFile(), "rw");
channel = file.getChannel();
remap(0);
}
public void close() throws IOException {
file.close();
}
public void seek(long offset) {
ioOffset = offset;
}
public void writeBytes(byte[] data) throws IOException {
if (data.length > WINDOW_SIZE) {
throw new IOException("Data chunk too big, length=" + data.length + ", max=" + WINDOW_SIZE);
}
boolean dataChunkWontFit = ioOffset < mapOffset || ioOffset + data.length > mapOffset + WINDOW_SIZE;
if (dataChunkWontFit) {
remap(ioOffset);
}
int offsetWithinBuffer = (int)(ioOffset - mapOffset);
buffer.position(offsetWithinBuffer);
buffer.put(data, 0, data.length);
}
private void remap(long offset) throws IOException {
mapOffset = offset;
buffer = channel.map(FileChannel.MapMode.READ_WRITE, mapOffset, WINDOW_SIZE);
}
}
Here is a test snippet:
SlidingFileWriterThingy t = new SlidingFileWriterThingy(Paths.get("/tmp/hey.txt"));
t.writeBytes("Hello world\n".getBytes(StandardCharsets.UTF_8));
t.seek(1000);
t.writeBytes("Are we there yet?\n".getBytes(StandardCharsets.UTF_8));
t.seek(50_000_000);
t.writeBytes("No but seriously?\n".getBytes(StandardCharsets.UTF_8));
And what the output file looks like:
$ hexdump -C /tmp/hey.txt
00000000 48 65 6c 6c 6f 20 77 6f 72 6c 64 0a 00 00 00 00 |Hello world.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000003e0 00 00 00 00 00 00 00 00 41 72 65 20 77 65 20 74 |........Are we t|
000003f0 68 65 72 65 20 79 65 74 3f 0a 00 00 00 00 00 00 |here yet?.......|
00000400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
02faf080 4e 6f 20 62 75 74 20 73 65 72 69 6f 75 73 6c 79 |No but seriously|
02faf090 3f 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |?...............|
02faf0a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
037af080
I hope I did not ruin everything by removing the unnecessary bits and renaming... At least the offset computation looks correct (0x3e0 + 8 = 1000, and 0x02faf080 = 50000000).
Number of blocks (left column) occupied by the file, and another non-sparse file of the same size:
$ head -c 58388608 /dev/zero > /tmp/not_sparse.txt
$ ls -ls /tmp/*.txt
8 -rw-r--r-- 1 nug nug 58388608 Jul 19 00:50 /tmp/hey.txt
57024 -rw-r--r-- 1 nug nug 58388608 Jul 19 00:58 /tmp/not_sparse.txt
Number of blocks (and actual "sparseness") will depend on OS & filesystem, the above was on Debian Buster, ext4 -- Sparse files are not supported on HFS+ for macOS, and on Windows they require the program to do something specific I don't know enough about, but that does not seem easy or even doable from Java, not sure.
I don't have fresh numbers but at the time this "sliding-MappedByteBuffer
technique" was very fast, and as you can see above, it does leave holes in the file.
You'll need to adapt WINDOW_SIZE
to something that makes sense for you, add all the writeThingy
methods you need, perhaps by wrapping writeBytes
, whatever suits you. Also, in this state it will grow the file as needed, but by chunks of WINDOW_SIZE
, which you might also need to adapt.
Unless there is a very good reason not to, it's probably best to keep it simple with this single mechanism, rather than maintaining a complex dual-mode system.
About the fragility and memory consumption, I've ran the stress-test below on Linux without any issue for an hour, on a machine with 800GB of RAM, and on another very modest VM with 1G of RAM. System looks perfectly healthy, java process does not use any significant amount of heap memory.
String path = "/tmp/data.txt";
SlidingFileWriterThingy w = new SlidingFileWriterThingy(Paths.get(path));
final long MAX = 5_000_000_000L;
while (true) {
long offset = 0;
while (offset < MAX) {
offset += Math.pow(Math.random(), 4) * 100_000_000;
if (offset > MAX/5 && offset < 2*MAX/5 || offset > 3*MAX/5 && offset < 4*MAX/5) {
// Keep 2 big "empty" bands in the sparse file
continue;
}
w.seek(offset);
w.writeBytes(("---" + new Date() + "---").getBytes(StandardCharsets.UTF_8));
}
w.seek(0);
System.out.println("---");
Scanner output = new Scanner(new ProcessBuilder("sh", "-c", "ls -ls " + path + "; free")
.redirectErrorStream(true).start().getInputStream());
while (output.hasNextLine()) {
System.out.println(output.nextLine());
}
Runtime r = Runtime.getRuntime();
long memoryUsage = (100 * (r.totalMemory() - r.freeMemory())) / r.totalMemory();
System.out.println("Mem usage: " + memoryUsage + "%");
Thread.sleep(1000);
}
So yes that's empirical, maybe it only works correctly on recent Linux systems, maybe it's just luck with that particular workload... but I'm starting to think it's a valid solution on some systems and workloads, it can be useful.
MappedByteBuffer
for the write operations.. If the file is too large or needs to grow, I would wrap this in a class which maps in "blocks" and then moves the block along as you are writing .. The algorithm for this is fairly straightforward.. Just pick a block size that makes sense for the data you are writing.. – Nonaggressionjava.nio.channels
. You can do random access with aFileChannel
, and write buffered data. – Thoma