Python Does Not Read Entire Text File
Asked Answered
P

4

15

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.

My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.

The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.

For example, I'll do a really simply command such as:

newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
    replaced = line.replace("string1", "string2")
    newfile.write(replaced)

And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?

I tried a few different solutions such as using:

import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
   sys.stdout.write(line.replace("string1", "string2")

But it has the same effect. Nor does reading the file in chunks such as using

f.read(10000)

I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.

Can anyone offer any advice or things to try?

I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7

Pronominal answered 28/3, 2012 at 10:45 Comment(7)
Reading line by line with an iterator should be a lazy operation, so it should work regardless to the size of the file. While it shouldn't affect your situation, you will also want to use with when opening files - it's a good practice than handles closing under exceptions correctly.Casease
That worked great! Thanks so much. *edit: I tried posting the iterator code here, but it wouldn't format, so I added it to the original post.Pronominal
Have you tried it with a different large text file? Is there something strange with the file 382mb in - some strange character that is being treated as the end of file?Stubblefield
I have. I thought it might have been the file at first, but I tried it with ones of varying size from various sources. The large I tried was 2.6 gb and the smallest one I tried was 560 mb, but they all stop at 382 mb.Pronominal
There's no reason your original code shouldn't have worked. It's also "lazy" as @Latty calls it. You shouldn't need to write your own iterator, or to read in chunks.Rosa
Related question: Line reading chokes on 0x1AValenti
I'd like to note that when I said iterator, that wasn't what I meant - I meant one as in your original example (for line in f). So, uh, no problem I guess, but I think the right answer here is codeape's.Casease
K
24

Try:

f = open("filename.txt", "rb")

On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).

You can also specify the mode when using fileinput:

fileinput.input("filename.txt", inplace=1, mode="rb")
Kendallkendell answered 28/3, 2012 at 11:9 Comment(3)
That also works! I like that solution the most, because how easy it is to change the existing code.Pronominal
How there "that also works" ? This is clearly your problem. What other approach did work as well? Ah, I see in the comments, specifying a byte-lenght to be read, instead of using "readline"Tenderloin
I faced exactly the same problem. It works perfectly!Liberty
K
4

Are you sure the problem is with reading and not with writing out? Do you close the file that is written to, either explicitly newfile.close() or using the with construct?

Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.

Kilovolt answered 28/3, 2012 at 11:37 Comment(0)
D
1

If you use the file like this:

with open("filename.txt") as f:
    for line in f:
        newfile.write(line.replace("string1", "string2"))

It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)

Duque answered 28/3, 2012 at 10:52 Comment(0)
T
1

Found to solution thanks to Gareth Latty. Using an iterator:

def read_in_chunks(file, chunk_size=1000): 
   while True: 
      data = file.read(chunk_size) 
      if not data: break 
      yield data

This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.

Trash answered 16/1, 2023 at 12:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.