How can I make reverse scanning of a binary file faster?
Asked Answered
P

2

27

I have a binary file specification that describes a packetized data structure. Each data packet has a two-byte sync pattern, so scanning for the beginning of a packet is possible, using a BinaryReader and FileStream combination:

while(!reader.EndOfFile)
{
    // Check for sync pattern.
    if (reader.ReadUInt16() != 0xEB25)
    {
        // Move to next byte.
        reader.BaseStream.Seek(-1, SeekOrigin.Current);
        continue;
    }

    // If we got here, a sync pattern was found.
}

This process works perfectly fine in the forward direction, but similar code scanning in the reverse direction is at least two orders of magnitude slower:

while(!reader.BeginningOfFile)
{
    // Check for sync pattern.
    if (reader.ReadUInt16() != 0xEB25)
    {
        // Move to previous byte.
        reader.BaseStream.Seek(-3, SeekOrigin.Current);
        continue;
    }

    // If we got here, a sync pattern was found.
}

I've tried a few workarounds, like moving back by an arbitrary amount (currently 1 megabyte) and scanning forward, but it's becoming clear that what I really need is a BinaryReader or FileStream that is modified to have adequate performance characteristics when reading in both forward and reverse directions.

I already have a FastFileStream which improves forward read performance by subclassing an ordinary FileStream and caching the Position and Length properties (it also provides the BeginningOfFile and EndOfFile properties). That's what drives the reader variable in the code above.

Is there something similar I could do to improve reverse reading performance, perhaps by incorporating a MemoryStream as a buffer?

Pleasure answered 5/3, 2012 at 22:34 Comment(10)
This process works perfectly fine in the forward direction. It is also awful. Read two bytes, if is not EB25 then seek one byte back.Creole
@L.B: The actual code is optimized a bit more than this... In the forward direction, I check for 0x25 first (respecting Little Endian), and then for 0xEB. The code I've posted here is simplified for clarity. Trust me, it's not the way I'm checking the bytes; the slowdown in the reverse direction is happening because file systems are not designed to work backwards like this.Pleasure
You're not going to have much luck here. All layers (Lib, OS, Hardware) are optimized for forward reading. Your approach of taking a big step back and then scan-forward seems reasonable.Castrato
Then I would try Memory Mapped filesCreole
@HenkHolterman: Agreed, but I need to abstract that approach out, and I'm not quite sure how to go about doing it. I'd like it to be the same API that FileStream currently has.Pleasure
@Creole It's an interesting thought, but these files are very large, and I still prefer the streaming API of the FileStream class. A FastForwardReverseFileStream would be a drop-in replacement.Pleasure
OK, you can go with FastForwardReverseFileStream. Just a note: using Memory Mapped file doesn't mean you load all the file content into the memory.Creole
I'd try to read pieces (eg 50Mb) from end to memory and perform fast array scan to search your pattern. Next read previous 50Mb piece and repeat.Lowenstern
If all else fails, reverse the contents of your file and scan forward. I know This is a bit of a shot in the dark.Commission
Why don't you use BufferedStream over FileStream? Moving backward or forward does not make difference as Buffered Stream will read file in block and scanning block in reverse order or forward order will not make any great difference, and even with normal FileStream, you can read a block of byte buffer and read it in reverse order, in simple for loop with decrementing index, filestream uses 4kb buffer anyway, but it is optimized for forward reading.Rosecan
N
16

L.B mentioned in a comment to use a Memory Mapped file, you may be impressed with the performance.

Please try something like this:

var memoryMapName = Path.GetFileName(fileToRead);

using (var mapStream = new FileStream(fileToRead, FileMode.Open))
{
    using (var myMap = MemoryMappedFile.CreateFromFile(
                            mapStream, 
                            memoryMapName, mapStream.Length,
                            MemoryMappedFileAccess.Read, null, 
                            HandleInheritability.None, false))
    {                    
        long leftToRead = mapStream.Length;
        long mapSize = Math.Min(1024 * 1024, mapStream.Length);
        long bytesRead = 0;
        long mapOffset = Math.Max(mapStream.Length - mapSize, 0);

        while (leftToRead > 1)
        {
            using (var FileMap = myMap.CreateViewAccessor(mapOffset, 
                                 mapSize, MemoryMappedFileAccess.Read))
            {
                long readAt = mapSize - 2;
                while (readAt > -1)
                {
                    var int16Read = FileMap.ReadUInt16(readAt);
                    //0xEB25  <--check int16Read here                            
                    bytesRead += 1;
                    readAt -= 1;
                }
            }

            leftToRead = mapStream.Length- bytesRead;
            mapOffset = Math.Max(mapOffset - mapSize, 0);
            mapSize = Math.Min(mapSize, leftToRead);
        }
    }
}
Noto answered 6/3, 2012 at 4:9 Comment(4)
I went ahead and upvoted your answer, because I think it's a good one, but just be aware that the question has a .NET 3.5 tag on it, and I believe that MemoryMappedFile is only available in .NET 4.0. :)Pleasure
To be fair you could always P/Invoke the mapped file API in 3.5 too, so if this way helps, you can still use it with some minor function call changes.Ber
@RobertHarvey You could do it in 3.5 but you would have to wrap the Win32 functionsDoll
@RobertHarvey Ahh, yes...I did see the 3.5 tag, but was hoping it wasn't a definite requirement. I did some tests with MemoryMapped files, and even reading forward is significantly faster than using a FileStream. I'm sorry this will not work for you.Noto
G
23

EDIT: Okay, I've got some code. Well, quite a lot of code. It allows you to scan forwards and backwards for packet headers.

I make no guarantee that it has no bugs, and you definitely want to tweak the buffer size to see how it performs... but given the same file you sent me, it at least shows the same packet header locations when scanning forwards and backwards :)

Before the code, I would still suggest that if you possibly can, scanning through the file once and saving an index of packet information for later use would probably be a better approach.

Anyway, here's the code (complete with no tests other than the sample program):

PacketHeader.cs:

using System;

namespace Chapter10Reader
{
    public sealed class PacketHeader
    {
        private readonly long filePosition;
        private readonly ushort channelId;
        private readonly uint packetLength;
        private readonly uint dataLength;
        private readonly byte dataTypeVersion;
        private readonly byte sequenceNumber;
        private readonly byte packetFlags;
        private readonly byte dataType;
        private readonly ulong relativeTimeCounter;

        public long FilePosition { get { return filePosition; } }
        public ushort ChannelId { get { return channelId; } }
        public uint PacketLength { get { return packetLength; } }
        public uint DataLength { get { return dataLength; } }
        public byte DataTypeVersion { get { return dataTypeVersion; } }
        public byte SequenceNumber { get { return sequenceNumber; } }
        public byte PacketFlags { get { return packetFlags; } }
        public byte DataType { get { return dataType; } }
        public ulong RelativeTimeCounter { get { return relativeTimeCounter; } }

        public PacketHeader(ushort channelId, uint packetLength, uint dataLength, byte dataTypeVersion,
            byte sequenceNumber, byte packetFlags, byte dataType, ulong relativeTimeCounter, long filePosition)
        {
            this.channelId = channelId;
            this.packetLength = packetLength;
            this.dataLength = dataLength;
            this.dataTypeVersion = dataTypeVersion;
            this.sequenceNumber = sequenceNumber;
            this.packetFlags = packetFlags;
            this.dataType = dataType;
            this.relativeTimeCounter = relativeTimeCounter;
            this.filePosition = filePosition;
        }

        internal static PacketHeader Parse(byte[] data, int index, long filePosition)
        {
            if (index + 24 > data.Length)
            {
                throw new ArgumentException("Packet header must be 24 bytes long; not enough data");
            }
            ushort syncPattern = BitConverter.ToUInt16(data, index + 0);
            if (syncPattern != 0xeb25)
            {
                throw new ArgumentException("Packet header must start with the sync pattern");
            }
            ushort channelId = BitConverter.ToUInt16(data, index + 2);
            uint packetLength = BitConverter.ToUInt32(data, index + 4);
            uint dataLength = BitConverter.ToUInt32(data, index + 8);
            byte dataTypeVersion = data[index + 12];
            byte sequenceNumber = data[index + 13];
            byte packetFlags = data[index + 14];
            byte dataType = data[index + 15];
            // TODO: Validate this...
            ulong relativeTimeCounter =
                (ulong)BitConverter.ToUInt32(data, index + 16) +
                ((ulong)BitConverter.ToUInt16(data, index + 20)) << 32;
            // Assume we've already validated the checksum...
            return new PacketHeader(channelId, packetLength, dataLength, dataTypeVersion, sequenceNumber,
                packetFlags, dataType, relativeTimeCounter, filePosition);
        }

        /// <summary>
        /// Checks a packet header's checksum to see whether this *looks* like a packet header.
        /// </summary>
        internal static bool CheckPacketHeaderChecksum(byte[] data, int index)
        {
            if (index + 24 > data.Length)
            {
                throw new ArgumentException("Packet header must is 24 bytes long; not enough data");
            }
            ushort computed = 0;
            for (int i = 0; i < 11; i++)
            {
                computed += BitConverter.ToUInt16(data, index + i * 2);
            }
            return computed == BitConverter.ToUInt16(data, index + 22);
        }
    }
}

PacketScanner.cs:

using System;
using System.Diagnostics;
using System.IO;

namespace Chapter10Reader
{
    public sealed class PacketScanner : IDisposable
    {
        // 128K buffer... tweak this.
        private const int BufferSize = 1024 * 128;

        /// <summary>
        /// Where in the file does the buffer start?
        /// </summary>
        private long bufferStart;

        /// <summary>
        /// Where in the file does the buffer end (exclusive)?
        /// </summary>
        private long bufferEnd;

        /// <summary>
        /// Where are we in the file, logically?
        /// </summary>
        private long logicalPosition;

        // Probably cached by FileStream, but we use it a lot, so let's
        // not risk it...
        private readonly long fileLength;

        private readonly FileStream stream;
        private readonly byte[] buffer = new byte[BufferSize];        

        private PacketScanner(FileStream stream)
        {
            this.stream = stream;
            this.fileLength = stream.Length;
        }

        public void MoveToEnd()
        {
            logicalPosition = fileLength;
            bufferStart = -1; // Invalidate buffer
            bufferEnd = -1;
        }

        public void MoveToBeforeStart()
        {
            logicalPosition = -1;
            bufferStart = -1;
            bufferEnd = -1;
        }

        private byte this[long position]
        {
            get 
            {
                if (position < bufferStart || position >= bufferEnd)
                {
                    FillBuffer(position);
                }
                return buffer[position - bufferStart];
            }
        }

        /// <summary>
        /// Fill the buffer to include the given position.
        /// If the position is earlier than the buffer, assume we're reading backwards
        /// and make position one before the end of the buffer.
        /// If the position is later than the buffer, assume we're reading forwards
        /// and make position the start of the buffer.
        /// If the buffer is invalid, make position the start of the buffer.
        /// </summary>
        private void FillBuffer(long position)
        {
            long newStart;
            if (position > bufferStart)
            {
                newStart = position;
            }
            else
            {
                // Keep position *and position + 1* to avoid swapping back and forth too much
                newStart = Math.Max(0, position - buffer.Length + 2);
            }
            // Make position the start of the buffer.
            int bytesRead;
            int index = 0;
            stream.Position = newStart;
            while ((bytesRead = stream.Read(buffer, index, buffer.Length - index)) > 0)
            {
                index += bytesRead;
            }
            bufferStart = newStart;
            bufferEnd = bufferStart + index;
        }

        /// <summary>
        /// Make sure the buffer contains the given positions.
        /// 
        /// </summary>
        private void FillBuffer(long start, long end)
        {
            if (end - start > buffer.Length)
            {
                throw new ArgumentException("Buffer not big enough!");
            }
            if (end > fileLength)
            {
                throw new ArgumentException("Beyond end of file");
            }
            // Nothing to do.
            if (start >= bufferStart && end < bufferEnd)
            {
                return;
            }
            // TODO: Optimize this more to use whatever bits we've actually got.
            // (We're optimized for "we've got the start, get the end" but not the other way round.)
            if (start >= bufferStart)
            {
                // We've got the start, but not the end. Just shift things enough and read the end...
                int shiftAmount = (int) (end - bufferEnd);
                Buffer.BlockCopy(buffer, shiftAmount, buffer, 0, (int) (bufferEnd - bufferStart - shiftAmount));
                stream.Position = bufferEnd;
                int bytesRead;
                int index = (int)(bufferEnd - bufferStart - shiftAmount);
                while ((bytesRead = stream.Read(buffer, index, buffer.Length - index)) > 0)
                {
                    index += bytesRead;
                }
                bufferStart += shiftAmount;
                bufferEnd = bufferStart + index;
                return;
            }

            // Just fill the buffer starting from start...
            bufferStart = -1;
            bufferEnd = -1;
            FillBuffer(start);
        }

        /// <summary>
        /// Returns the header of the next packet, or null 
        /// if we've reached the end of the file.
        /// </summary>
        public PacketHeader NextHeader()
        {
            for (long tryPosition = logicalPosition + 1; tryPosition < fileLength - 23; tryPosition++)
            {
                if (this[tryPosition] == 0x25 && this[tryPosition + 1] == 0xEB)
                {
                    FillBuffer(tryPosition, tryPosition + 24);
                    int bufferPosition = (int) (tryPosition - bufferStart);
                    if (PacketHeader.CheckPacketHeaderChecksum(buffer, bufferPosition))
                    {
                        logicalPosition = tryPosition;
                        return PacketHeader.Parse(buffer, bufferPosition, tryPosition);
                    }
                }
            }
            logicalPosition = fileLength;
            return null;
        }

        /// <summary>
        /// Returns the header of the previous packet, or null 
        /// if we've reached the start of the file.
        /// </summary>
        public PacketHeader PreviousHeader()
        {
            for (long tryPosition = logicalPosition - 1; tryPosition >= 0; tryPosition--)
            {
                if (this[tryPosition + 1] == 0xEB && this[tryPosition] == 0x25)
                {
                    FillBuffer(tryPosition, tryPosition + 24);
                    int bufferPosition = (int)(tryPosition - bufferStart);
                    if (PacketHeader.CheckPacketHeaderChecksum(buffer, bufferPosition))
                    {
                        logicalPosition = tryPosition;
                        return PacketHeader.Parse(buffer, bufferPosition, tryPosition);
                    }
                }
            }
            logicalPosition = -1;
            return null;
        }

        public static PacketScanner OpenFile(string filename)
        {
            return new PacketScanner(File.OpenRead(filename));
        }

        public void Dispose()
        {
            stream.Dispose();
        }
    }
}

Program.cs (for testing):

using System;
using System.Collections.Generic;
using System.Linq;

namespace Chapter10Reader
{
    class Program
    {
        static void Main(string[] args)
        {
            string filename = "test.ch10";

            Console.WriteLine("Forwards:");
            List<long> positionsForward = new List<long>();
            using (PacketScanner scanner = PacketScanner.OpenFile(filename))
            {
                scanner.MoveToBeforeStart();
                PacketHeader header;
                while ((header = scanner.NextHeader()) != null)
                {
                    Console.WriteLine("Found header at {0}", header.FilePosition);
                    positionsForward.Add(header.FilePosition);
                }
            }
            Console.WriteLine();
            Console.WriteLine("Backwards:");
            List<long> positionsBackward = new List<long>();
            using (PacketScanner scanner = PacketScanner.OpenFile(filename))
            {
                scanner.MoveToEnd();
                PacketHeader header;
                while ((header = scanner.PreviousHeader()) != null)
                {
                    positionsBackward.Add(header.FilePosition);
                }
            }
            positionsBackward.Reverse();
            foreach (var position in positionsBackward)
            {
                Console.WriteLine("Found header at {0}", position);
            }

            Console.WriteLine("Same? {0}", positionsForward.SequenceEqual(positionsBackward));
        }
    }
}
Grogan answered 8/3, 2012 at 17:48 Comment(13)
There's a Packet Header that has a field in it that specifies the size of the packet. I might read the packet, or I might use the size to skip to the next packet if that's not the packet that I want. This is just fine for forward scanning, provided the file is not corrupted, in which case I have to revert to scanning for sync patterns. The packets can be up to 1/2 megabyte in size. The specification for the packets is available in excruciating detail in this document, starting at section 10.6, if you're interested.Pleasure
@RobertHarvey: Thanks, will have a look at home tonight and try to whip up some code. Do you have a sample data file I could play with?Grogan
The optimization is necessary due to a Next/Previous feature in the UI. Skipping backwards by a single packet is not really a problem; the delay is perceptible but not egregious. Skipping backwards by many packets is not so imperceptible; the packets have a channel number assigned to them, and I might want the previous packet in the same channel, so I might have to go back a hundred packets or more to find it.Pleasure
Sample data is here: irig106.org/sample_data. I will get you the login and password.Pleasure
@RobertHarvey: It strikes me that the bytes 0xEB25 could occur anywhere in arbitrary data. If you're scanning backwards, what's to stop you from getting false positives, and treating some random bit of data as a packet header? The header checksum would help to prevent this to some extent, but it wouldn't be foolproof... Do you really need to scan backwards rather than scanning forwards once and remembering all the headers, then just seeking to the packets you want?Grogan
The Packet Header stores a simple checksum of itself, as well as a checksum of the data payload. Verifying the Packet Header checksum establishes the sync pattern as valid. The files are extremely large (measured in tens of gigabytes); pre-scanning each one to build an index would require about 30 to 60 minutes per file.Pleasure
@RobertHarvey: Yes, I know. My point is that if a data packet happens to contain an 0xEB25 sequence of bytes, it would be quite unlikely but not impossible for that portion of data, when parsed as a packet, to have the right checksum. (And really - 30 minutes per file? You can read a lot of data sequentially in that time. Like, huge amounts... especially if you're actually seeking to skip over the data parts.)Grogan
The sync pattern can be further confirmed by moving forward past the packet using the packet length, and verifying that there is another sync pattern present there. The seek operation needs to take place in milliseconds, not minutes (it's on a "previous" button in the UI.) Skipping over the data payload of each packet approximately doubles read performance, so my estimates might be off by 50%.Pleasure
@RobertHarvey: Right, but that's going to be pretty expensive, and still not entirely foolproof. Basically it looks like this was a file format which was fundamentally designed to be read forwards.Grogan
Quite right. There is an indexing capability already built-in, but since it's not a required feature of the specification, nobody uses it. :P So the index packets that would allow random access are simply not there.Pleasure
@RobertHarvey: Hmm. How often do you need to process new files? Could you perform the indexing yourself once initially, store that with the file, and then use that? I really think that going backwards is going to lead to slow, ugly, unreliable code. If you could ping me the username/password ([email protected]) for the samples that would be great :)Grogan
I sent you a sample file. Worth noting: code already exists that abstracts the process of reading and verifying the Packet Headers and data payload. But it depends on FileStream and BinaryReader objects. If the FileStream object can be optimized, no further work would be needed.Pleasure
@RobertHarvey: Well, if I'm going to put together a complete program, I might as well include that bit. I'm not going to interpret much of the data... just read it for properties.Grogan
N
16

L.B mentioned in a comment to use a Memory Mapped file, you may be impressed with the performance.

Please try something like this:

var memoryMapName = Path.GetFileName(fileToRead);

using (var mapStream = new FileStream(fileToRead, FileMode.Open))
{
    using (var myMap = MemoryMappedFile.CreateFromFile(
                            mapStream, 
                            memoryMapName, mapStream.Length,
                            MemoryMappedFileAccess.Read, null, 
                            HandleInheritability.None, false))
    {                    
        long leftToRead = mapStream.Length;
        long mapSize = Math.Min(1024 * 1024, mapStream.Length);
        long bytesRead = 0;
        long mapOffset = Math.Max(mapStream.Length - mapSize, 0);

        while (leftToRead > 1)
        {
            using (var FileMap = myMap.CreateViewAccessor(mapOffset, 
                                 mapSize, MemoryMappedFileAccess.Read))
            {
                long readAt = mapSize - 2;
                while (readAt > -1)
                {
                    var int16Read = FileMap.ReadUInt16(readAt);
                    //0xEB25  <--check int16Read here                            
                    bytesRead += 1;
                    readAt -= 1;
                }
            }

            leftToRead = mapStream.Length- bytesRead;
            mapOffset = Math.Max(mapOffset - mapSize, 0);
            mapSize = Math.Min(mapSize, leftToRead);
        }
    }
}
Noto answered 6/3, 2012 at 4:9 Comment(4)
I went ahead and upvoted your answer, because I think it's a good one, but just be aware that the question has a .NET 3.5 tag on it, and I believe that MemoryMappedFile is only available in .NET 4.0. :)Pleasure
To be fair you could always P/Invoke the mapped file API in 3.5 too, so if this way helps, you can still use it with some minor function call changes.Ber
@RobertHarvey You could do it in 3.5 but you would have to wrap the Win32 functionsDoll
@RobertHarvey Ahh, yes...I did see the 3.5 tag, but was hoping it wasn't a definite requirement. I did some tests with MemoryMapped files, and even reading forward is significantly faster than using a FileStream. I'm sorry this will not work for you.Noto

© 2022 - 2024 — McMap. All rights reserved.