Filter (search and replace) array of bytes in an InputStream
Asked Answered
E

6

26

I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .

I have a string: "XYZ". I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.

Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.

Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.

TL;DR: My question is:

Is a way to find and replace a string in byte format in a raw InputStream in Java?

Ella answered 12/10, 2011 at 16:44 Comment(4)
And why do you read the file as a byte stream? if you read it as a String (for instance, with a StringReader) you can solve your problem and forget about the encodingSpinthariscope
Why do convert the strings to byte arrays and compare them, instead of comparing the original strings?Balaklava
Basically what you need is tutorials.jenkov.com/java-howto/… .Diet
There's a good reason to use bytes and not rely on the character encoding. I've had problems with files that had utf-8 conversion issues compounded by people storing the corrupted content with the wrong encoding and also mixing different newline styles. Easy to clean up once you know what to look for. But you have to do it at the byte level.Gallice
O
37

Not sure you have chosen the best approach to solve your problem.

That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...

Have a look at FilterInputStream.

From the documentation:

A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.


It was a fun exercise to write it up. Here's a complete example for you:

import java.io.*;
import java.util.*;

class ReplacingInputStream extends FilterInputStream {

    LinkedList<Integer> inQueue = new LinkedList<Integer>();
    LinkedList<Integer> outQueue = new LinkedList<Integer>();
    final byte[] search, replacement;

    protected ReplacingInputStream(InputStream in,
                                   byte[] search,
                                   byte[] replacement) {
        super(in);
        this.search = search;
        this.replacement = replacement;
    }

    private boolean isMatchFound() {
        Iterator<Integer> inIter = inQueue.iterator();
        for (int i = 0; i < search.length; i++)
            if (!inIter.hasNext() || search[i] != inIter.next())
                return false;
        return true;
    }

    private void readAhead() throws IOException {
        // Work up some look-ahead.
        while (inQueue.size() < search.length) {
            int next = super.read();
            inQueue.offer(next);
            if (next == -1)
                break;
        }
    }

    @Override
    public int read() throws IOException {    
        // Next byte already determined.
        if (outQueue.isEmpty()) {
            readAhead();

            if (isMatchFound()) {
                for (int i = 0; i < search.length; i++)
                    inQueue.remove();

                for (byte b : replacement)
                    outQueue.offer((int) b);
            } else
                outQueue.add(inQueue.remove());
        }

        return outQueue.remove();
    }

    // TODO: Override the other read methods.
}

Example Usage

class Test {
    public static void main(String[] args) throws Exception {

        byte[] bytes = "hello xyz world.".getBytes("UTF-8");

        ByteArrayInputStream bis = new ByteArrayInputStream(bytes);

        byte[] search = "xyz".getBytes("UTF-8");
        byte[] replacement = "abc".getBytes("UTF-8");

        InputStream ris = new ReplacingInputStream(bis, search, replacement);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        int b;
        while (-1 != (b = ris.read()))
            bos.write(b);

        System.out.println(new String(bos.toByteArray()));

    }
}

Given the bytes for the string "Hello xyz world" it prints:

Hello abc world
Obsecrate answered 12/10, 2011 at 16:55 Comment(6)
+1 for the clean, queue-based implementation, but depending on the application, it might matter that this simple approach is slow: O(MN)*, where M is pattern length and N is file length. Also, depending on what you're searching for, ignoring the structure of HTML can get you into trouble.Xylina
good point and I completely agree with you. I just did the part I thought was fun :-) didn't even implement all read methods..Obsecrate
Try to find new byte[] { (byte) 0xFF, (byte) 0x00} you will be surprised You must user byte_value & 0xFF values for byte->integer instead simple write byte_value For example outQueue.offer((int) b); must be outQueue.offer((int) (b&0xFF));Sindhi
This is good but this doesn't work for special characters either. i had to use ` byte next = (byte) super.read(); inQueue.offer((int) next);` because otherwise super.read() will turn things into an int, which fails the findAmethyst
What do you mean with "special characters"?Obsecrate
Fine implementation if you don't need performance. I went with my solution above to avoid having to box primitives like you have to do with yours. Also, you are effectively forcing every byte through the readAhead (so you box every byte) and your comparison is expensive and unboxes. Mine is a simple == on a byte and the NOT_MATCHING state has zero overhead beyond doing that on 1 byte. It matters if you are processing a few GB of csv. Also it won't have the issue @DavidT mentioned.Gallice
R
4

The following approach will work but I don't how big the impact is on the performance.

  1. Wrap the InputStream with a InputStreamReader,
  2. wrap the InputStreamReader with a FilterReader that replaces the strings, then
  3. wrap the FilterReader with a ReaderInputStream.

It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.

If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.

Rapt answered 22/6, 2012 at 14:44 Comment(0)
G
4

I needed something like this as well and decided to roll my own solution instead of using the example above by @aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.

This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.

new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

Here's the full source code:

/**
 * Simple FilterInputStream that can replace occurrances of bytes with something else.
 */
public class ReplacingInputStream extends FilterInputStream {

    // while matching, this is where the bytes go.
    int[] buf=null;
    int matchedIndex=0;
    int unbufferIndex=0;
    int replacedIndex=0;

    private final byte[] pattern;
    private final byte[] replacement;
    private State state=State.NOT_MATCHED;

    // simple state machine for keeping track of what we are doing
    private enum State {
        NOT_MATCHED,
        MATCHING,
        REPLACING,
        UNBUFFER
    }

    /**
     * @param is input
     * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
     */
    public static InputStream newLineNormalizingInputStream(InputStream is) {
        return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
    }

    /**
     * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
     * @param in input
     * @param pattern pattern to replace.
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, String pattern, String replacement) {
        this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
    }

    /**
     * Replace occurances of pattern in the input.
     * @param in input
     * @param pattern pattern to replace
     * @param replacement the replacement or null
     */
    public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
        super(in);
        Validate.notNull(pattern);
        Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
        this.pattern = pattern;
        this.replacement = replacement;
        // we will never match more than the pattern length
        buf = new int[pattern.length];
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
        if (b == null) {
            throw new NullPointerException();
        } else if (off < 0 || len < 0 || len > b.length - off) {
            throw new IndexOutOfBoundsException();
        } else if (len == 0) {
            return 0;
        }

        int c = read();
        if (c == -1) {
            return -1;
        }
        b[off] = (byte)c;

        int i = 1;
        try {
            for (; i < len ; i++) {
                c = read();
                if (c == -1) {
                    break;
                }
                b[off + i] = (byte)c;
            }
        } catch (IOException ee) {
        }
        return i;

    }

    @Override
    public int read(byte[] b) throws IOException {
        // call our own read
        return read(b, 0, b.length);
    }

    @Override
    public int read() throws IOException {
        // use a simple state machine to figure out what we are doing
        int next;
        switch (state) {
        case NOT_MATCHED:
            // we are not currently matching, replacing, or unbuffering
            next=super.read();
            if(pattern[0] == next) {
                // clear whatever was there
                buf=new int[pattern.length]; // clear whatever was there
                // make sure we start at 0
                matchedIndex=0;

                buf[matchedIndex++]=next;
                if(pattern.length == 1) {
                    // edgecase when the pattern length is 1 we go straight to replacing
                    state=State.REPLACING;
                    // reset replace counter
                    replacedIndex=0;
                } else {
                    // pattern of length 1
                    state=State.MATCHING;
                }
                // recurse to continue matching
                return read();
            } else {
                return next;
            }
        case MATCHING:
            // the previous bytes matched part of the pattern
            next=super.read();
            if(pattern[matchedIndex]==next) {
                buf[matchedIndex++]=next;
                if(matchedIndex==pattern.length) {
                    // we've found a full match!
                    if(replacement==null || replacement.length==0) {
                        // the replacement is empty, go straight to NOT_MATCHED
                        state=State.NOT_MATCHED;
                        matchedIndex=0;
                    } else {
                        // start replacing
                        state=State.REPLACING;
                        replacedIndex=0;
                    }
                }
            } else {
                // mismatch -> unbuffer
                buf[matchedIndex++]=next;
                state=State.UNBUFFER;
                unbufferIndex=0;
            }
            return read();
        case REPLACING:
            // we've fully matched the pattern and are returning bytes from the replacement
            next=replacement[replacedIndex++];
            if(replacedIndex==replacement.length) {
                state=State.NOT_MATCHED;
                replacedIndex=0;
            }
            return next;
        case UNBUFFER:
            // we partially matched the pattern before encountering a non matching byte
            // we need to serve up the buffered bytes before we go back to NOT_MATCHED
            next=buf[unbufferIndex++];
            if(unbufferIndex==matchedIndex) {
                state=State.NOT_MATCHED;
                matchedIndex=0;
            }
            return next;

        default:
            throw new IllegalStateException("no such state " + state);
        }
    }

    @Override
    public String toString() {
        return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
    }

}
Gallice answered 2/12, 2016 at 22:1 Comment(3)
link to github.com/Inbot/inbot-utils/blob/master/src/main/java/io/inbot/… is deadOutofdate
Thanks for pointing that out, I forked the repository before the company shut down. Just updated the link.Gallice
Fine implementation if you don't need correctness. I went with my solution above to handle basic things like replacing ab with xx given input aab. Any solution that claims to be production quality should use KMP or something similar for a task like this.Obsecrate
X
3

There isn't any built-in functionality for search-and-replace on byte streams (InputStream).

And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.

Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.

So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.

Xylina answered 12/10, 2011 at 17:41 Comment(0)
L
1

I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.

This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.

That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.

In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.

The solution we are using:

import java.io.IOException;
import java.io.InputStream;

public class TokenReplacingStream extends InputStream {

    private final InputStream source;
    private final byte[] oldBytes;
    private final byte[] newBytes;
    private int tokenMatchIndex = 0;
    private int bytesIndex = 0;
    private boolean unwinding;
    private int mismatch;
    private int numberOfTokensReplaced = 0;

    public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
        assert oldBytes.length > 0;
        this.source = source;
        this.oldBytes = oldBytes;
        this.newBytes = newBytes;
    }

    @Override
    public int read() throws IOException {

        if (unwinding) {
            if (bytesIndex < tokenMatchIndex) {
                return oldBytes[bytesIndex++];
            } else {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                unwinding = false;
                return mismatch;
            }
        } else if (tokenMatchIndex == oldBytes.length) {
            if (bytesIndex == newBytes.length) {
                bytesIndex = 0;
                tokenMatchIndex = 0;
                numberOfTokensReplaced++;
            } else {
                return newBytes[bytesIndex++];
            }
        }

        int b = source.read();
        if (b == oldBytes[tokenMatchIndex]) {
            tokenMatchIndex++;
        } else if (tokenMatchIndex > 0) {
            mismatch = b;
            unwinding = true;
        } else {
            return b;
        }

        return read();

    }

    @Override
    public void close() throws IOException {
        source.close();
    }

    public int getNumberOfTokensReplaced() {
        return numberOfTokensReplaced;
    }

}
Laynelayney answered 9/9, 2018 at 13:21 Comment(0)
R
1

I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.

/JC

public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
{
    replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
}

public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
{
    char[] searchChars = search.toCharArray();
    int[] buffer = new int[searchChars.length];

    int x, r, si = 0, sm = searchChars.length;
    while ((r = in.read()) > 0) {

        if (searchChars[si] == r) {
            // The char matches our pattern
            buffer[si++] = r;

            if (si == sm) {
                // We have reached a matching string
                out.write(replace);
                si = 0;
            }
        } else if (si > 0) {
            // No match and buffered char(s), empty buffer and pass the char forward
            for (x = 0; x < si; x++) {
                out.write(buffer[x]);
            }
            si = 0;
            out.write(r);
        } else {
            // No match and nothing buffered, just pass the char forward
            out.write(r);
        }
    }

    // Empty buffer
    for (x = 0; x < si; x++) {
        out.write(buffer[x]);
    }
}
Renault answered 11/12, 2018 at 23:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.