Is there a difference between : "file.readlines()", "list(file)" and "file.read().splitlines(True)"?
Asked Answered
F

6

18

What is the difference between :

with open("file.txt", "r") as f:
    data = list(f)

Or :

with open("file.txt", "r") as f:
    data = f.read().splitlines(True)

Or :

with open("file.txt", "r") as f:
    data = f.readlines()

They seem to produce the exact same output. Is one better (or more pythonic) than the other ?

Flower answered 23/7, 2018 at 13:16 Comment(2)
Ofc, I tried first. They produce the exact same output. That's why I ask if there is any difference... (added a small edit for clarity).Flower
The biggest question is why you need that list. If you're eventually going to iterate over it once the most pythonic thing to do is never build it and iterate over the lines of the file instead.Whitening
A
9

Explicit is better than implicit, so I prefer:

with open("file.txt", "r") as f:
    data = f.readlines()

But, when it is possible, the most pythonic is to use the file iterator directly, without loading all the content to memory, e.g.:

with open("file.txt", "r") as f:
    for line in f:
       my_function(line)
Acerb answered 23/7, 2018 at 13:27 Comment(8)
I don't think I can use the iterator in this case. I need to read the first line from the file, use it and do some stuff with it. Than delete the first line from the file so that the second line become the first and so on. I was thinking using one of those than using data = data[1:] and writing back inside the file.Flower
@Bermuda: firstline = next(f). Then do stuff with it. Then with open("file.txt.temp", "r") as f2: f2.write(f.read()). Then move file.txt.temp over file.txt.Whitening
This works and this is exactly what I needed ! Which is very nice... but I don't understand how it works. According to the doc, next() retrieve the next item from the iterator. No problem. But how come when f2.write(f.read()) later, the first line has disappeared ? Does f.read() shares the same iterator with next() and therefore starts reading from that point ?Flower
@StevenRumbalski This is a really good way to accomplish what he wanted. I just think it deviates completely the purpose of his question. He should probably ask another question so you can post your proposed method. Personally, I wouldn't have known how to handle this. But I don't see how future users will find this answer considering how he formulated his question and the fact that it is a comment.Date
@StevenRumbalski Here is an open thread for your answer https://mcmap.net/q/741162/-how-to-read-and-delete-first-n-lines-from-file-in-python-elegant-solution-duplicate/7692463. Don't hesitate giving me feedback if you think I can improve wording of the question.Date
@Bermuda: An open file object acts as a line-by-line iterator. That's why data = list(f) works. next(someiter) tries to yield one from an iterator. If you call f.read() after a line has been pulled off it will continue reading where the iterator left off (unless you reset the file pointer with f.seek(0) (this also resets iteration)).Whitening
@scharette: Deviating from the purpose of the question is just fine for comments. The OP had an XY problem. What I wrote isn't an answer so much as a small teaching moment of basic Python. It doesn't detract from the answer here.Whitening
@StevenRumbalski I understand but what I meant was his question is still valid even if he wrongly formulated it. People will tend to vote for people telling him to use iterators because it is optimal for his specific case, but its not in the context of his question.Date
D
5

TL;DR;

Considering you need a list to manipulate them afterwards, your three proposed solutions are all syntactically valid. There is no better (or more pythonic) solution, especially since they all are recommended by the official Python documentation. So, choose the one you find the most readable and be consistent with it throughout your code. If performance is a deciding factor, see my timeit analysis below.


Here is the timeit (10000 loops, ~20 line in test.txt),

import timeit

def foo():
    with open("test.txt", "r") as f:
        data = list(f)

def foo1():
    with open("test.txt", "r") as f:
        data = f.read().splitlines(True)

def foo2():
    with open("test.txt", "r") as f:
        data = f.readlines()

print(timeit.timeit(stmt=foo, number=10000))
print(timeit.timeit(stmt=foo1, number=10000))
print(timeit.timeit(stmt=foo2, number=10000))

>>>> 1.6370758459997887
>>>> 1.410844805999659
>>>> 1.8176437409965729

I tried it with multiple number of loops and lines, and f.read().splitlines(True) always seems to be performing a bit better than the two others.

Now, syntactically speaking, all of your examples seems to be valid. Refer to this documentation for more informations.

According to it, if your goal is to read lines form a file,

for line in f:
    ...

where they states that it is memory efficient, fast, and leads to simple code. Which would be another good alternative in your case if you don't need to manipulate them in a list.

EDIT

Note that you don't need to pass your True boolean to splitlines. It has your wanted behavior by default.

My personal recommendation

I don't want to make this answer too opinion-based, but I think it would be beneficial for you to know, that I don't think performance should be your deciding factor until it is actually a problem for you. Especially since all syntax are allowed and recommended in the official Python doc I linked.

So, my advice is,:

First, pick the most logical one for your particular case and then choose the one you find the most readable and be consistent with it throughout your code.

Date answered 23/7, 2018 at 13:32 Comment(6)
Thank you if the only differences are stylistic, yes better perf are always nice :)Flower
@Flower Indeed, but note that you should also try to use timeit on your specific computer to see what's seems to be the most efficient. Just out of curiosity, try my code and get back to me on whats seems to be the best on your computer.Date
Relevant to your analysis, how many lines did test.txt contain? How big was the file?Unfit
@MichaelMior I edited the question by specifying the number of lines but as stated in the answer, I also tried multiple files size and number of loops. At least from what I was able to test, f.read().splitlines(True) was performing better. You can maybe confirm you have similar behavior.Date
@Date Thanks for sharing. I would be hesitant to draw any conclusions from a test with only 20 lines in a file, but I agree it's probably true that there's not a huge difference.Unfit
Thanks for the benchmarks! Folks should definitely measure on their particular machine. I have a ~300 MB text file, with a line count in the low millions, and on my 2019 MacBook Pro, f.read().splitlines() (regardless of the keepends parameter) is about 2 seconds slower than f.readlines().Armenian
B
5

All three of your options produce the same end result, but nonetheless, one of them is definitely worse than the other two: doing f.read().splitlines(True).

The reason this is the worst option is that it requires the most memory. f.read() reads the file content into memory as a single (maybe huge) string object, then calling .splitlines(True) on that additionally creates the list of the individual lines, and then only after that does the string object containing the file's entire content get garbage collected and its memory freed. So, at the moment of peak memory use - just before the memory for the big string is freed - this approach requires enough memory to store the entire content of the file in memory twice - once as a string, and once as an array of strings.

By contrast, doing list(f) or f.readlines() will read a line from disk, add it to the result list, then read the next line, and so on. So the whole file content is never duplicated in memory, and the peak memory use will thus be about half that of the .splitlines(True) approach. These approaches are thus superior to using .read() and .splitlines(True).

As for list(f) vs f.readlines(), there's no concrete advantage to either of them over the other; the choice between them is a matter of style and taste.

Bair answered 29/12, 2019 at 13:14 Comment(0)
G
2

In the 3 cases, you're using a context manager to read a file. This file is a file object.

File Object

An object exposing a file-oriented API (with methods such as read() or write()). Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.). File objects are also called file-like objects or streams. The canonical way to create a file object is by using the open() function. https://docs.python.org/3/glossary.html#term-file-object

list

with open("file.txt", "r") as f:
    data = list(f)

This works because your file object is a stream like object. converting to list works roughly like this :

[element for element in generator until I hit stopIteration]

readlines method

with open("file.txt", "r") as f:
    data = f.readlines()

The method readlines() reads until EOF using readline() and returns a list containing the lines.

Difference with list :

  1. You can specify the number of elements you want to read : fileObject.readlines( sizehint )

  2. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.

read

When should I ever use file.read() or file.readlines()?

Godesberg answered 23/7, 2018 at 13:44 Comment(0)
A
2

They're all achieving the same goal of returning a list of strings but using separate approaches. f.readlines() is the most Pythonic.

with open("file.txt", "r") as f:
    data = list(f)

f here is a file-like object, which is being iterated over through list, which returns lines in the file.


with open("file.txt", "r") as f:
    data = f.read().splitlines(True)

f.read() returns a string, which you split on newlines, returning a list of strings.


with open("file.txt", "r") as f:
    data = f.readlines()

f.readlines() does the same as above, it reads the entire file and splits on newlines.

Acceptable answered 23/7, 2018 at 18:14 Comment(0)
W
0

Based on my experience, you can use both splitlines() and readlines(), but you should keep in mind, that sometimes splitlines() can provide different results due to it default splitting by newline symbol. So readlines() has more predictable result if you process data possibly contains \n right in the text.

Waylen answered 16/4 at 5:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.