GZipStream with StreamReader.ReadLine only reads first line
Asked Answered
S

1

6

I have a gzip file containing a txt file that needs to be cleaned up. I would like to read from the GZipped file line by line and then write the cleaned content to an output GZIP file all in one shot like this:

    void ExtractAndFix(string inputPath, string outputPath) {
        StringBuilder sbLine = new StringBuilder();

        using (GZipStream gzInput = new GZipStream(new FileStream(inputPath, FileMode.Open), System.IO.Compression.CompressionMode.Decompress)) {
            using (StreamReader reader = new StreamReader(gzInput, Encoding.UTF8)) {
                using (GZipOutputStream gzipWriter = new GZipOutputStream(new FileStream(outputPath, FileMode.Create))) {
                    string line = null;
                    while ((line = reader.ReadLine()) != null) {
                        sbLine.Clear();
                        sbLine.Append(line.Replace("\t", " "));
                        sbLine.Append("\r\n");
                        byte[] bytes = Encoding.UTF8.GetBytes(sbLine.ToString());
                        gzipWriter.Write(bytes, 0, bytes.Length);
                    }
                }
            }
        }
    }

But for some reason that call to line = reader.ReadLine() in the while loop ONLY reads once and then returns null (reader EOS = true). I've tried this both with the native C# compression library and with the ICSharpCode package as well and I get the same behavior. I realize I could always just extract the full file, then clean it, then re-compress it, but i hate having to waste the resources, hard drive space etc. Note: these are large files (up to several GB compressed) so anything with MemoryStream is not going to be a good solution. Has anyone encountered anything odd like this before? Thank you.

Synagogue answered 18/9, 2014 at 17:22 Comment(3)
Are you sure that file is actually just compressed stream and not Zip archive?Vowell
@ Alexei Levenkov - If it was Zip it would never be able to create the GZip stream, it would fail because the file type would be incorrectSynagogue
Possible duplicate of Decompressing using GZipStream returns only the first lineBenumb
S
6

After a lot of hair pulling I appear to have found the issue. For me the problem was further compounded by the fact that certain GZip files would work fine while others would display the behavior above. For example, if I created the archive myself with GZip it would work great, but certain other archives generated from other sources would not.

In short, the .NET GZip library is garbage, don't use it. In addition, the ICSharpCode library I was using was a couple years old. I'm not sure if it used to piggyback on the underlying .NET code or not, but the version I had previously (0.85.4) gave the exact same behavior. When I upgraded to the latest version (0.86.0) it worked as expected and I was able to read the full file as expected.

Hopefully this helps someone else with the same issue

Synagogue answered 19/9, 2014 at 14:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.