Replacing a string within a stream in C# (without overwriting the original file)
Asked Answered
B

3

15

I have a file that I'm opening into a stream and passing to another method. However, I'd like to replace a string in the file before passing the stream to the other method. So:

string path = "C:/...";
Stream s = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
//need to replace all occurrences of "John" in the file to "Jack" here.
CallMethod(s);

The original file should not be modified, only the stream. What would be the easiest way to do this?

Thanks...

Breakfast answered 16/9, 2013 at 19:33 Comment(1)
You could read it with a StreamReader here: msdn.microsoft.com/en-us/library/system.io.streamreader.aspxAffidavit
A
17

It's a lot easier if you just read in the file as lines, and then deal with those, instead of forcing yourself to stick with a Stream, simply because stream deals with both text and binary files, and needs to be able to read in one character at a time (which makes such replacement very hard). If you read in a whole line at a time (so long as you don't have multi-line replacement) it's quite easy.

var lines = File.ReadLines(path)
    .Select(line => line.Replace("John", "Jack"));

Note that ReadLines still does stream the data, and Select doesn't need to materialize the whole thing, so you're still not reading the whole file into memory at one time when doing this.

If you don't actually need to stream the data you can easily just load it all as one big string, do the replace, and then create a stream based on that one string:

string data = File.ReadAllText(path)
    .Replace("John", "Jack");
byte[] bytes = Encoding.ASCII.GetBytes(data);
Stream s = new MemoryStream(bytes);
Arsenical answered 16/9, 2013 at 19:36 Comment(10)
What if the file is 10gb?Affidavit
@JeroenvanLangen Then there's no problem. The only issue would be if a single line of text was 10GB. Hopefully that's not the case.Arsenical
I like this answer better than mine.Characterize
@servy, +1 for the streaming, very nice solution. Like you said, 1 line of 10GB shouldn't be a TextFile :)Affidavit
It doesn't like it when I do CallMethod(lines). It's expecting System.IO.Stream as the parameter type. Am I missing a step? Thanks!Breakfast
@Breakfast No, as I described in this answer, if you wanted to do this you'd need to alter your other code to work off of an IEnumerable<string> rather than using a Stream. It's possible to solve this problem by creating a new stream, but it would be a lot more work.Arsenical
Unfortunately, I don't have access to the other code, it's in a different class that I can't modify. Second best option for me?Breakfast
@Breakfast Is your file large, small, or possibly either?Arsenical
It's very small, 50 lines at most.Breakfast
+1. But I'd advice to be cautious when using Encoding.ASCII.GetBytes see if UTF8/Unicode (maybe with BOM)... or even sniff encoding from original stream.Breda
C
3

If the file has extremely long lines, the replaced string may contain a newline or there are other constraints preventing the use of File.ReadLines() while requiring streaming, there is an alternative solution using streams only, even though it is not trivial.

Implement your own stream decorator (wrapper) that performs the replacement. I.e. a class based on Stream that takes another stream in its constructor, reads data from the stream in its Read(byte[], int, int) override and performs the replacement in the buffer. See notes to Stream implementers for further requirements and suggestions.

Let's call the string being replaced "needle", the source stream "haystack" and the replacement string "replacement".

Needle and replacement need to be encoded using the encoding of the haystack contents (typically Encoding.UTF8.GetBytes()). Inside streams, the data is not converted to string, unlike in StreamReader.ReadLine(). Thus unnecessary memory allocation is prevented.

Simple cases: If both needle and replacement are just a single byte, the implementation is just a simple loop over the buffer, replacing all occurrences. If needle is a single byte and replacement is empty (i.e. deleting the byte, e.g. deleting carriage return for line ending normalization), it is a simple loop maintaining from and to indexes to the buffer, rewriting the buffer byte by byte.

In more complex cases, implement the KMP algorithm to perform the replacement.

  • Read the data from the underlying stream (haystack) to an internal buffer that is at least as long as needle and perform the replacement while rewriting the data to the output buffer. The internal buffer is needed so that data from a partial match are not published before a complete match is detected -- then, it would be too late to go back and delete the match completely.

  • Process the internal buffer byte by byte, feeding each byte into the KMP automaton. With each automaton update, write the bytes it releases to the appropriate position in output buffer.

  • When a match is detected by KMP, replace it: reset the automaton keeping the position in the internal buffer (which deletes the match) and write the replacement in the output buffer.

  • When end of either buffer is reached, keep the unwritten output and unprocessed part of the internal buffer including current partial match as a starting point for next call to the method and return the current output buffer. Next call to the method writes the remaining output and starts processing the rest of haystack where the current one stopped.

  • When end of haystack is reached, release the current partial match and write it to the output buffer.

Just be careful not to return an empty output buffer before processing all the data of haystack -- that would signal end of stream to the caller and therefore truncate the data.

Cronyism answered 27/10, 2020 at 10:18 Comment(0)
C
1

This question probably has many good answers. I'll try one I've used and has always worked for me and my peers.

I suggest you create a separate stream, say a MemoryStream. Read from your filestream and write into the memory one. You can then extract strings from either and replace stuff, and you would pass the memory stream ahead. That makes it double sure that you are not messing up with the original stream, and you can ever read the original values from it whenever you need, though you are using basically twice as much memory by using this method.

Characterize answered 16/9, 2013 at 19:37 Comment(1)
Thanks for this suggestion, ended up using the memorystream.Breakfast

© 2022 - 2024 — McMap. All rights reserved.