How to clone an inputstream in java in minimal time
Asked Answered
W

4

6

Can someone tell me how to clone an inputstream, taking as little creation time as possible? I need to clone an inputstream multiple times for multiple methods to process the IS. I've tried three ways and things don't work for one reason or another.

Method #1: Thanks to the stackoverflow community, I found the following link helpful and have incorporated the code snippet in my program.

How to clone an InputStream?

However, using this code can take up to one minute (for a 10MB file) to create the cloned inputstreams and my program needs to be as fast as possible.

    int read = 0;
    byte[] bytes = new byte[1024*1024*2];

    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    while ((read = is.read(bytes)) != -1)
        bos.write(bytes,0,read);
    byte[] ba = bos.toByteArray();

    InputStream is1 = new ByteArrayInputStream(ba);
    InputStream is2 = new ByteArrayInputStream(ba);
    InputStream is3 = new ByteArrayInputStream(ba);

Method #2: I also tried using BufferedInputStream to clone the IS. This was fast (slowest creation time == 1ms. fastest == 0ms). However, after I sent is1 to be processed, the methods processing is2 and is3 threw an error saying there was nothing to process, almost like all 3 variables below referenced the same IS.

    is = getFileFromBucket(path,filename);
    ...
    ...
    InputStream is1 = new BufferedInputStream(is);
    InputStream is2 = new BufferedInputStream(is);
    InputStream is3 = new BufferedInputStream(is);

Method #3: I think the compiler is lying to me. I checked markSupported() for is1 for the two examples above. It returned true so I thought I could run

    is1.mark() 
    is1.reset()

or just

    is1.reset();

before passing the IS to my respective methods. In both of the above examples, I get an error saying it's an invalid mark.

I'm out of ideas now so thanks in advance for any help you can give me.

P.S. From the comments I've received from people, I need to clarify a couple things regarding my situation: 1) This program is running on a VM 2) The inputstream is being passed into me from another method. I'm not reading from a local file 3) The size of the inputstream is not known

Welborn answered 9/11, 2012 at 2:4 Comment(5)
Running the code for Method #1 takes 18 ms (for a 10 MB file) on my computer. Is there something wrong with your hardware?Maya
Thanks for the reply. I don't think there's anything wrong with my hardware. It just hit me that I forgot to mention 2 things: a) this is on a VM and b) the inputstream is of a jpg file. The fastest it's taken is 11 sec but eyeballing my tests, it's averaging like 30 sec or so, the slowest was about 1 min (53 sec to be exact).Welborn
You might get a minor boost if you do this: byte[] ba = new byte[is.available()]; // Works if it's a FileInputStream new DataInputStream(is).readFully(ba);Maya
@FJ - 18ms for a 10Mb file says to me that the entire file must already be in the OS's in-memory buffer cache.Threefold
@StephenC - Good point, easy to forget.Maya
D
6

how to clone an inputstream, taking as little creation time as possible? I need to clone an inputstream multiple times for multiple methods to process the IS

You could just create some kind of a custom ReusableInputStream class wherein you immediately also write to an internal ByteArrayOutputStream on the 1st full read, then wrap it in a ByteBuffer when the last byte is read and finally reuse the very same ByteBuffer on the subsequent full reads which get automatically flipped when limit is reached. This saves you from one full read as in your 1st attempt.

Here's a basic kickoff example:

public class ReusableInputStream extends InputStream {

    private InputStream input;
    private ByteArrayOutputStream output;
    private ByteBuffer buffer;

    public ReusableInputStream(InputStream input) throws IOException {
        this.input = input;
        this.output = new ByteArrayOutputStream(input.available()); // Note: it's resizable anyway.
    }

    @Override
    public int read() throws IOException {
        byte[] b = new byte[1];
        read(b, 0, 1);
        return b[0];
    }

    @Override
    public int read(byte[] bytes) throws IOException {
        return read(bytes, 0, bytes.length);
    }

    @Override
    public int read(byte[] bytes, int offset, int length) throws IOException {
        if (buffer == null) {
            int read = input.read(bytes, offset, length);

            if (read <= 0) {
                input.close();
                input = null;
                buffer = ByteBuffer.wrap(output.toByteArray());
                output = null;
                return -1;
            } else {
                output.write(bytes, offset, read);
                return read;
            }
        } else {
            int read = Math.min(length, buffer.remaining());

            if (read <= 0) {
                buffer.flip();
                return -1;
            } else {
                buffer.get(bytes, offset, read);
                return read;
            }
        }

    }

    // You might want to @Override flush(), close(), etc to delegate to input.
}

(note that the actual job is performed in int read(byte[], int, int) instead of in int read() and thus it's expected to be faster when the caller itself is also streaming using a byte[] buffer)

You could use it as follows:

InputStream input = new ReusableInputStream(getFileFromBucket(path,filename));
IOUtils.copy(input, new FileOutputStream("/copy1.ext"));
IOUtils.copy(input, new FileOutputStream("/copy2.ext"));
IOUtils.copy(input, new FileOutputStream("/copy3.ext"));

As to the performance, 1 minute per 10MB is more likely a hardware problem, not a software problem. My 7200rpm laptop harddisk does it in less than 1 second.

Disorderly answered 9/11, 2012 at 15:56 Comment(1)
Thanks for the code snippet. I will try it out along with the other suggestions!Welborn
T
3

However, using this code can take up to one minute (for a 10MB file) to create the cloned inputstreams and my program needs to be as fast as possible.

Well copying a stream takes time, and (in general) that is the only way to clone a stream. Unless you tighten the scope of the problem, there is little chance that the performance can be significantly improved.

Here are a couple of circumstances where improvement is possible:

  • If you knew beforehand the number of bytes in the stream then you can read directly into the final byte array.

  • If you knew that the data is coming from a file, you could create a memory mapped buffer for the file.

But the fundamental problem is that moving lots of bytes around takes time. And the fact that it is taking 1 minute for a 10Mb file (using the code in your Question) suggests that the real bottleneck is not in Java at all.

Threefold answered 9/11, 2012 at 3:6 Comment(0)
R
2

Regarding your first approach, the one consisting in putting all your bytes in an ByteArrayOutputStream:

  • First, this approach consumes a lot of memory. If you do not make sure that your JVM starts with enough memory allocated, it will need to dynamically request memory during the processing of your stream and this is time consuming.
  • Your ByteArrayOutputStream is initially created with a buffer of 32 bytes. Every time you try to put something in it, if it does not fit in the existing byte array a new bigger array is created and the old bytes are copied to the new one. Since you are using a 2MB input every time, you are forcing the ByteArrayOutputStream copy its data over and over again into bigger arrays, increasing the size of its array in 2MB every time.
  • Since the old arrays are garbage, it is probable that their memory is being reclaimed by the garbage collector, which makes your copying process even slower.
  • Perhaps you should define the ByArrayOutputStream using the constructor that specifies an initial buffer size. The more accurately that you set the size the faster the process should be because less intermediate copies will be required.

You second approach is bogus, you cannot decorate the same input stream within different other streams and expect the things to work. As the bytes are consumed by one stream, the inner stream is exhausted as well, and cannot provide the other streams with accurate data.

Before I extend my answer let me ask, are your other methods expecting to receive copies of the input stream running on a separate thread? Because if so, this sounds like the work for the PipedOutputStream and PipedInputStream?

Roush answered 9/11, 2012 at 6:49 Comment(1)
Thanks for your reply. Since another method is passing the inputstream to me I don't know the size of the IS coming in. I played with making the byte array to be 8MB but it still took a long time. Someone suggested I use BufferedInputStream and I guess I wasn't using it correctly so my bad for the bogus use =) I do plan to use threads for my other methods so I'll look into your suggestion of PipedIS and PipedOS to see if it helps. Right now, I'm just trying to get everything to work serially before I start playing with threads.Welborn
H
1

Do you intend the separate methods to run in parallel or sequentially? If sequentially, I see no reason to clone the input stream, so I have to assume you're planning to spin off threads to manage each stream.

I'm not near a computer right now to test this, but I'm thinking you'd be better off reading the input in chunks, of say 1024 bytes, and then pushing those chunks (or array copies of the chunks) onto your output streams with input streams attached to their thread ends. Have your readers block if there's no data available, etc.

Hendrik answered 9/11, 2012 at 6:33 Comment(1)
Thanks for your reply and suggestion. Yes, I plan to use threads...once I figure out how to fix this bottle neck. I'll try to do that, read in chunks but I'm getting the input stream passed in from another method so I'll need to see if it's feasible in my case.Welborn

© 2022 - 2024 — McMap. All rights reserved.