Best way to find position in the Stream where given byte sequence starts
Asked Answered
P

7

18

How do you think what is the best way to find position in the System.Stream where given byte sequence starts (first occurence):

public static long FindPosition(Stream stream, byte[] byteSequence)
{
    long position = -1;

    /// ???
    return position;
}

P.S. The simpliest yet fastest solution is preffered. :)

Posh answered 24/9, 2009 at 14:14 Comment(6)
your question is confusing...what are you looking for? that specific sequence of bytes in the stream?Lastly
I think the question's title should be updated. Stream is misspelled as Steam, which makes it seem like a question that should be tagged Valve.Russom
@chollida: Actually, I came to this question just to fix that.Golgi
Actually i'm looking for guid in the stream.Posh
is memory an issue? or can you read the whole stream into an array of bytes?Lastly
Please check if my solution fits your needs.Cowie
C
7

I've reached this solution.

I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines. With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).

public static long FindPosition(Stream stream, byte[] byteSequence)
{
    if (byteSequence.Length > stream.Length)
        return -1;

    byte[] buffer = new byte[byteSequence.Length];

    using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))
    {
        int i;
        while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)
        {
            if (byteSequence.SequenceEqual(buffer))
                return bufStream.Position - byteSequence.Length;
            else
                bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);
        }
    }

    return -1;
}

private static int PadLeftSequence(byte[] bytes, byte[] seqBytes)
{
    int i = 1;
    while (i < bytes.Length)
    {
        int n = bytes.Length - i;
        byte[] aux1 = new byte[n];
        byte[] aux2 = new byte[n];
        Array.Copy(bytes, i, aux1, 0, n);
        Array.Copy(seqBytes, aux2, n);
        if (aux1.SequenceEqual(aux2))
            return i;
        i++;
    }
    return i;
}
Cowie answered 24/9, 2009 at 16:8 Comment(2)
For future reference, PadLeftSequence is searching for the first non-matching byte which caused SequenceEqual to return false. It seems like a micro-optimisation to me, since one would expect SequenceEqual to early-return on a non-match anyway. Disclaimer: I have not done any measurements, this is only opinion.Backman
doesnt it only work if the sequence is at index of a length multipication? I mean, 6 bytes seq at index 4 will not be found ?Bare
L
5

If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.

Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)

public static final int BUFFER_SIZE = 32;

public static int [] buildShiftArray(byte [] byteSequence){
    int [] shifts = new int[byteSequence.length];
    int [] ret;
    int shiftCount = 0;
    byte end = byteSequence[byteSequence.length-1];
    int index = byteSequence.length-1;
    int shift = 1;

    while(--index >= 0){
        if(byteSequence[index] == end){
            shifts[shiftCount++] = shift;
            shift = 1;
        } else {
            shift++;
        }
    }
    ret = new int[shiftCount];
    for(int i = 0;i < shiftCount;i++){
        ret[i] = shifts[i];
    }
    return ret;
}

public static byte [] flushBuffer(byte [] buffer, int keepSize){
    byte [] newBuffer = new byte[buffer.length];
    for(int i = 0;i < keepSize;i++){
        newBuffer[i] = buffer[buffer.length - keepSize + i];
    }
    return newBuffer;
}

public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){
    int index = needle.length;
    int searchIndex, needleIndex, currentShiftIndex = 0, shift;
    boolean shiftFlag = false;

    index = needle.length;
    while(true){
        needleIndex = needle.length-1;
        while(true){
            if(index >= haystackSize)
                return -1;
            if(haystack[index] == needle[needleIndex])
                break;
            index++;
        }
        searchIndex = index;
        needleIndex = needle.length-1;
        while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){
            searchIndex--;
            needleIndex--;
        }
        if(needleIndex < 0)
            return index-needle.length+1;
        if(shiftFlag){
            shiftFlag = false;
            index += shiftArray[0];
            currentShiftIndex = 1;
        } else if(currentShiftIndex >= shiftArray.length){
            shiftFlag = true;
            index++;
        } else{
            index += shiftArray[currentShiftIndex++];
        }           
    }
}

public static int findBytes(InputStream stream, byte [] needle){
    byte [] buffer = new byte[BUFFER_SIZE];
    int [] shiftArray = buildShiftArray(needle);
    int bufferSize, initBufferSize;
    int offset = 0, init = needle.length;
    int val;

    try{
        while(true){
            bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);
            if(bufferSize == -1)
                return -1;
            if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)
                return val+offset;
            buffer = flushBuffer(buffer, needle.length);
            offset += bufferSize-init;
            init = 0;
        }
    } catch (IOException e){
        e.printStackTrace();
    }
    return -1;
}
Lastly answered 24/9, 2009 at 14:19 Comment(3)
it might not be simplest, but it's pretty fast. it think given the constraints of reading from a stream doesn't allow for simple if you want speed. but I hope my code can alleviate some of your troubles, or be of help to someone in the future.Lastly
It seems initBufferSize variable in findBytes is not used.Agree
Heads up: This solution seems to be in Java, whereas OP asked for C#Civilize
H
3

You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.

It's likely to be a bit fiddly whatever you do, to be honest :(

Hebel answered 24/9, 2009 at 14:14 Comment(0)
U
3

I needed to do this myself, had already started, and didn't like the solutions above. I specifically needed to find where the search-byte-sequence ends. In my situation, I need to fast-forward the stream until after that byte sequence. But you can use my solution for this question too:

var afterSequence = stream.ScanUntilFound(byteSequence);
var beforeSequence = afterSequence - byteSequence.Length;

Here is StreamExtensions.cs

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace System
{

    static class StreamExtensions
    {
        /// <summary>
        /// Advances the supplied stream until the given searchBytes are found, without advancing too far (consuming any bytes from the stream after the searchBytes are found).
        /// Regarding efficiency, if the stream is network or file, then MEMORY/CPU optimisations will be of little consequence here.
        /// </summary>
        /// <param name="stream">The stream to search in</param>
        /// <param name="searchBytes">The byte sequence to search for</param>
        /// <returns></returns>
        public static int ScanUntilFound(this Stream stream, byte[] searchBytes)
        {
            // For this class code comments, a common example is assumed:
            // searchBytes are {1,2,3,4} or 1234 for short
            // # means value that is outside of search byte sequence

            byte[] streamBuffer = new byte[searchBytes.Length];
            int nextRead = searchBytes.Length;
            int totalScannedBytes = 0;

            while (true)
            {
                FillBuffer(stream, streamBuffer, nextRead);
                totalScannedBytes += nextRead; //this is only used for final reporting of where it was found in the stream

                if (ArraysMatch(searchBytes, streamBuffer, 0))
                    return totalScannedBytes; //found it

                nextRead = FindPartialMatch(searchBytes, streamBuffer);
            }
        }

        /// <summary>
        /// Check all offsets, for partial match. 
        /// </summary>
        /// <param name="searchBytes"></param>
        /// <param name="streamBuffer"></param>
        /// <returns>The amount of bytes which need to be read in, next round</returns>
        static int FindPartialMatch(byte[] searchBytes, byte[] streamBuffer)
        {
            // 1234 = 0 - found it. this special case is already catered directly in ScanUntilFound            
            // #123 = 1 - partially matched, only missing 1 value
            // ##12 = 2 - partially matched, only missing 2 values
            // ###1 = 3 - partially matched, only missing 3 values
            // #### = 4 - not matched at all

            for (int i = 1; i < searchBytes.Length; i++)
            {
                if (ArraysMatch(searchBytes, streamBuffer, i))
                {
                    // EG. Searching for 1234, have #123 in the streamBuffer, and [i] is 1
                    // Output: 123#, where # will be read using FillBuffer next. 
                    Array.Copy(streamBuffer, i, streamBuffer, 0, searchBytes.Length - i);
                    return i; //if an offset of [i], makes a match then only [i] bytes need to be read from the stream to check if there's a match
                }
            }

            return 4;
        }

        /// <summary>
        /// Reads bytes from the stream, making sure the requested amount of bytes are read (streams don't always fulfill the full request first time)
        /// </summary>
        /// <param name="stream">The stream to read from</param>
        /// <param name="streamBuffer">The buffer to read into</param>
        /// <param name="bytesNeeded">How many bytes are needed. If less than the full size of the buffer, it fills the tail end of the streamBuffer</param>
        static void FillBuffer(Stream stream, byte[] streamBuffer, int bytesNeeded)
        {
            // EG1. [123#] - bytesNeeded is 1, when the streamBuffer contains first three matching values, but now we need to read in the next value at the end 
            // EG2. [####] - bytesNeeded is 4

            var bytesAlreadyRead = streamBuffer.Length - bytesNeeded; //invert
            while (bytesAlreadyRead < streamBuffer.Length)
            {
                bytesAlreadyRead += stream.Read(streamBuffer, bytesAlreadyRead, streamBuffer.Length - bytesAlreadyRead);
            }
        }

        /// <summary>
        /// Checks if arrays match exactly, or with offset. 
        /// </summary>
        /// <param name="searchBytes">Bytes to search for. Eg. [1234]</param>
        /// <param name="streamBuffer">Buffer to match in. Eg. [#123] </param>
        /// <param name="startAt">When this is zero, all bytes are checked. Eg. If this value 1, and it matches, this means the next byte in the stream to read may mean a match</param>
        /// <returns></returns>
        static bool ArraysMatch(byte[] searchBytes, byte[] streamBuffer, int startAt)
        {
            for (int i = 0; i < searchBytes.Length - startAt; i++)
            {
                if (searchBytes[i] != streamBuffer[i + startAt])
                    return false;
            }
            return true;
        }
    }
}
Uprear answered 25/2, 2017 at 14:5 Comment(0)
G
1

Most of the answers did not pass my test* (see below of my code) on 16 Mbyte random binary streams.
Only bruno conde's answer did work, but was a bit slower than my code (~1300 ms vs ~600 ms).

The solution code

static long FindPosition(Stream stream, byte[] pattern)
{
    long foundPosition = -1;
    int i = 0;
    int b;

    while ((b = stream.ReadByte()) > -1)
    {
        if (pattern[i++] != b)
        {
            stream.Position -= i - 1;
            i = 0;
            continue;
        }

        if (i == pattern.Length)
        {
            foundPosition = stream.Position - i;
            break;
        }
    }

    return foundPosition;
}

Description
Basically, it iterates the stream and the pattern array in parallel byte by byte.
As soon as the current stream byte does not match with the current pattern byte, the stream position will be set to the position where the current comparison started, but one byte further. And the pattern's "position" (i) is reset to 0, of course.
If i reaches the end of the pattern, all the stream bytes before were matching the whole pattern, so the position is found and can be returned.

The benchmark tests

Test file preparation

This is the code I used to create the test file:

private static void CreateRandomBinaryFile(string fileName, int sizeInMegabyte)
{
    Random rnd = new();
    byte[] b = new byte[1024];
    int rounds = sizeInMegabyte * 1024;

    using FileStream fs = File.Create(fileName);
    for (int i = 0; i < rounds; i++)
    {
        rnd.NextBytes(b);
        fs.Write(b, 0, b.Length);
    }
}
CreateRandomBinaryFile("fileOne.bin", 16);

For the second test file I moved the rnd.NextBytes(b) out of and before the for-loop and added this code after the for-loop:

b[0] = 0x12;
b[1] = 0x13;
b[2] = 0x15;
fs.Write(b, 0, b.Length);

So I will get a file where 1024 random bytes are repeated 1024 * 16 times, but only the last 1024-sequence modified.

The benchmark

public delegate long FindPositionDelegate(Stream stream, byte[] pattern);

struct Test
{
    public string File;
    public byte[] Needle;
}

static void Main(string[] args)
{
    Dictionary<string, FindPositionDelegate> codeSoultions = new()
    {
        { "This", FindPosition },
        { "Bruno", BrunoConde.FindPosition }
    };

    List<Test> tests = new()
    {
        new Test() {
            File = @"fileOne.bin",
            Needle = new byte[] { 0x4a, 0xaf, 0x6f, 0x34 }
        },
        new Test()
        {
            File = @"fileTwo.bin",
            Needle = new byte[] { 0xfc, 0x1d, 0x40, 0x38, 0x0b, 0xe4, 0x50, 0x5b,
                0x1a, 0xbf, 0x2a, 0xab, 0x9b, 0x1b, 0x0f, 0xc7, 0x12, 0x13, 0x15 }
        }
    };

    for (int i = 0; i < 3; i++)
    {
        foreach (var solution in codeSoultions)
        {
            foreach (var test in tests)
            {
                using FileStream file = File.OpenRead(test.File);
                Stopwatch watch = Stopwatch.StartNew();
                long pos = solution.Value(file, test.Needle);
                watch.Stop();
                long elapsedMs = watch.ElapsedMilliseconds;
                Console.WriteLine($"Iteration: {i},\t" +
                    $"Method: {solution.Key},\t" +
                    $"Result: {pos},\t" +
                    $"Elapsed ms: {elapsedMs}");
            }
        }
    }
}

Test #1
This code: ~ 598 ms
Bruno code: ~ 1371 ms

Test #2
This code: ~ 1371 ms
Bruno code: ~ 1509 ms

*
dharga's code thows Exception but does work with small byte arrays.
Kind Contributor throws Exception but does work with small byte arrays.
The others, where I have commented, return wrong values.

Glendon answered 9/6, 2023 at 21:24 Comment(0)
A
0

Bit old question, but here's my answer. I've found that reading blocks and then searching in that is extremely inefficient compared to just reading one at a time and going from there.

Also, IIRC, the accepted answer would fail if part of the sequence was in one block read and half in another - ex, given 12345, searching for 23, it would read 12, not match, then read 34, not match, etc... haven't tried it, though, seeing as it requires net 4.0. At any rate, this is way simpler, and likely much faster.

static long ReadOneSrch(Stream haystack, byte[] needle)
{
    int b;
    long i = 0;
    while ((b = haystack.ReadByte()) != -1)
    {
        if (b == needle[i++])
        {
            if (i == needle.Length)
                return haystack.Position - needle.Length;
        }
        else
            i = b == needle[0] ? 1 : 0;
    }

    return -1;
}
Actinopod answered 21/9, 2011 at 4:17 Comment(1)
The edit still does not fix all issues. Consider haystack = { 1, 1, 1, 0, 0, 1, 1, 0 } and needle = { 1, 1, 0 }. This code returns 5, which is the second occurence, instead of the right/first one at position 1.Glendon
A
-1
static long Search(Stream stream, byte[] pattern)
{
    long start = -1;

    stream.Seek(0, SeekOrigin.Begin);

    while(stream.Position < stream.Length)
    {
        if (stream.ReadByte() != pattern[0])
            continue;

        start = stream.Position - 1;

        for (int idx = 1; idx < pattern.Length; idx++)
        {
            if (stream.ReadByte() != pattern[idx])
            {
                start = -1;
                break;
            }
        }

        if (start > -1)
        {
            return start;
        }
    }

    return start;
}
Apeak answered 3/6, 2021 at 17:17 Comment(3)
Welcome to Stack Overflow. Try to avoid code only answer, and give some explanation of your code.Hals
this is a bad idea for non-seekable stream.Decision
Consider haystack = { 1, 1, 1, 0, 0, 1, 1, 0 } and needle = { 1, 1, 0 }. This code returns 5, which is the second occurence, instead of the right/first one at position 1.Glendon

© 2022 - 2024 — McMap. All rights reserved.