I have a gzip file containing a txt file that needs to be cleaned up. I would like to read from the GZipped file line by line and then write the cleaned content to an output GZIP file all in one shot like this:
void ExtractAndFix(string inputPath, string outputPath) {
StringBuilder sbLine = new StringBuilder();
using (GZipStream gzInput = new GZipStream(new FileStream(inputPath, FileMode.Open), System.IO.Compression.CompressionMode.Decompress)) {
using (StreamReader reader = new StreamReader(gzInput, Encoding.UTF8)) {
using (GZipOutputStream gzipWriter = new GZipOutputStream(new FileStream(outputPath, FileMode.Create))) {
string line = null;
while ((line = reader.ReadLine()) != null) {
sbLine.Clear();
sbLine.Append(line.Replace("\t", " "));
sbLine.Append("\r\n");
byte[] bytes = Encoding.UTF8.GetBytes(sbLine.ToString());
gzipWriter.Write(bytes, 0, bytes.Length);
}
}
}
}
}
But for some reason that call to line = reader.ReadLine() in the while loop ONLY reads once and then returns null (reader EOS = true). I've tried this both with the native C# compression library and with the ICSharpCode package as well and I get the same behavior. I realize I could always just extract the full file, then clean it, then re-compress it, but i hate having to waste the resources, hard drive space etc. Note: these are large files (up to several GB compressed) so anything with MemoryStream is not going to be a good solution. Has anyone encountered anything odd like this before? Thank you.